Erik Zachte wrote:
Sometimes the jump job reports all is well when it is not (Brion knows this)
This part's now fixed; .bz2 failures will not report .7z success on the next run around (but could on the current run while the program's still running).
-- brion vibber (brion @ pobox.com)
If I can possibly please make a suggestion / request, one thing I would quite like from the download.wikipedia.org site is an index file somewhere (could be plain text or XML) that indicates what the latest valid individual dump files are for a given Wikipedia site.
The "latest" directory is not useful for this purpose (e.g. http://download.wikipedia.org/enwiki/latest/ points to files from approx Aug-17, which looks to be the latest dump where everything reported as succeeded; but the later dumps are still useful for most of the files, just not the really large ones). An index file of this type would supersede the http://download.wikipedia.org/enwiki/latest/ directory, and would probably live in http://download.wikipedia.org/enwiki/ .
Given that sometimes some dumps fail, it could be good to make automating the downloading and processing of dump files easier. That way dump consumers could have a cron job that say once a day would get the latest index file, and download the latest dump files they wanted, if they had been updated.
It might help here to show a rough mock-up example of the type of file I'm thinking of:
=================================================================== <mediawiki xsi:schemaLocation="http://download.wikipedia.org/xml/export-0.1/" version="0.1" xml:lang="en"> <siteinfo> <sitename>English Wikipedia</sitename> </siteinfo> <dump type="site_stats.sql.gz"> <desc>A few statistics such as the page count.</desc> <url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-site_stats.sql...</url> <size_in_bytes>451</size_in_bytes> <timestamp>2006-09-24T16:29:01Z</timestamp> <md5sum>e4defa79c36823c67ed4d937f8f7013c</md5sum> </dump> <dump type="pages-articles.xml.bz2"> <desc>Articles, templates, image descriptions, and primary meta-pages.</desc> <timestamp>2006-09-24T22:12:24Z</timestamp> <url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-pages-articles...</url> <md5sum>2742b1b4b131d9a28887823da91cf2a5</md5sum> <size_in_bytes>1710328527</size_in_bytes> </dump>
.... snip various dump entries ....
<dump type="pages-meta-history.xml.7z"> <desc>All pages with complete edit history (.7z)</desc> <timestamp>2006-08-16T12:55:00Z</timestamp> <url>http://download.wikipedia.org/enwiki/20060816/enwiki-20060816-pages-meta-his...</url> <md5sum>24160a71229bee02bb813825bf7413db</md5sum> <size_in_bytes>5132097632</size_in_bytes> </dump> </mediawiki> ===================================================================
... the above file is probably invalid XML and needs to tweaked and so forth, but hopefully it illustrates the idea (e.g. the pages-articles.xml.bz2 entry is recent, whereas the pages-meta-history.xml.7z file is a month older, but both represent the latest valid dump for that type of file). Someone who for example only wants the "All pages with complete edit history (.7z)" file can download this file once a day, then when the entry changes have it download the file, verify the size in bytes matches, verify the Md5sum matches, and if everything is good, extract the file, maybe then locally verify it's a valid XML file, and if it's all still good process the file in an automated way. Also after every individual dump file was successfully created, the index file would probably have to be updated (to ensure it was always current). At the moment the above information is I think currently on the download.wikipedia.org site, but it's just scattered out over a number of different places; this would basically unify all that information into a nice useful data format.
All the best, Nick.