Erik Zachte wrote:
Sometimes the jump job reports all is well when
it is not (Brion knows this)
This part's now fixed; .bz2 failures will not report .7z success on the next run
around (but could on the current run while the program's still running).
-- brion vibber (brion @
pobox.com)
If I can possibly please make a suggestion / request, one thing I would quite like from
the
download.wikipedia.org site is an index
file somewhere (could be plain text or XML) that indicates what the latest valid
individual dump files are for a given Wikipedia
site.
The "latest" directory is not useful for this purpose (e.g.
http://download.wikipedia.org/enwiki/latest/ points to files from approx
Aug-17, which looks to be the latest dump where everything reported as succeeded; but the
later dumps are still useful for most of
the files, just not the really large ones). An index file of this type would supersede
the
http://download.wikipedia.org/enwiki/latest/ directory, and would probably live in
http://download.wikipedia.org/enwiki/ .
Given that sometimes some dumps fail, it could be good to make automating the downloading
and processing of dump files easier. That
way dump consumers could have a cron job that say once a day would get the latest index
file, and download the latest dump files
they wanted, if they had been updated.
It might help here to show a rough mock-up example of the type of file I'm thinking
of:
===================================================================
<mediawiki
xsi:schemaLocation="http://download.wikipedia.org/xml/export-0.1/"
version="0.1" xml:lang="en">
<siteinfo>
<sitename>English Wikipedia</sitename>
</siteinfo>
<dump type="site_stats.sql.gz">
<desc>A few statistics such as the page count.</desc>
<url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-site_stats.sql.gz</url>
<size_in_bytes>451</size_in_bytes>
<timestamp>2006-09-24T16:29:01Z</timestamp>
<md5sum>e4defa79c36823c67ed4d937f8f7013c</md5sum>
</dump>
<dump type="pages-articles.xml.bz2">
<desc>Articles, templates, image descriptions, and primary
meta-pages.</desc>
<timestamp>2006-09-24T22:12:24Z</timestamp>
<url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-pages-articles.xml.bz2</url>
<md5sum>2742b1b4b131d9a28887823da91cf2a5</md5sum>
<size_in_bytes>1710328527</size_in_bytes>
</dump>
.... snip various dump entries ....
<dump type="pages-meta-history.xml.7z">
<desc>All pages with complete edit history (.7z)</desc>
<timestamp>2006-08-16T12:55:00Z</timestamp>
<url>http://download.wikipedia.org/enwiki/20060816/enwiki-20060816-pages-meta-history.xml.7z</url>
<md5sum>24160a71229bee02bb813825bf7413db</md5sum>
<size_in_bytes>5132097632</size_in_bytes>
</dump>
</mediawiki>
===================================================================
... the above file is probably invalid XML and needs to tweaked and so forth, but
hopefully it illustrates the idea (e.g. the
pages-articles.xml.bz2 entry is recent, whereas the pages-meta-history.xml.7z file is a
month older, but both represent the latest
valid dump for that type of file). Someone who for example only wants the "All pages
with complete edit history (.7z)" file can
download this file once a day, then when the entry changes have it download the file,
verify the size in bytes matches, verify the
Md5sum matches, and if everything is good, extract the file, maybe then locally verify
it's a valid XML file, and if it's all still
good process the file in an automated way. Also after every individual dump file was
successfully created, the index file would
probably have to be updated (to ensure it was always current). At the moment the above
information is I think currently on the
download.wikipedia.org site, but it's just scattered out over a number of different
places; this would basically unify all that
information into a nice useful data format.
All the best,
Nick.