Steve Summit
And in case no one's made the observation: after just a couple of initial hiccups (affecting, not surprisingly, only the biggest wikis), it seems to be working very well, with all dumps successfully up-to-date, on a cycle of just a few days. http://download.wikimedia.org/ is a very pretty picture now. Well done!
Sorry Steve you are quite mistaken. I'm probably the one person looking at the xml download progress report most often, as I wait impatiently for a good moment to run wikistats after month has completed
Dumps for largest Wikipedias still fail very frequently. There has not been a useful English dumps for months and maybe a handful in a whole year. Sometimes the jump job reports all is well when it is not (Brion knows this)
I hate to chase Brion because he has thousand obligations but dump process is pretty unstable still
http://download.wikimedia.org/enwiki/20060925/ is running now, but http://download.wikimedia.org/enwiki/20060920/ reports it is still in progress http://download.wikimedia.org/enwiki/20060911/ reports on the 7z file all is OK but is 36 Mb http://download.wikimedia.org/enwiki/20060906/ reports on the 7z file all is OK but is 19 Mb http://download.wikimedia.org/enwiki/20060905/ reports on the 7z file all is OK but is 98 bytes http://download.wikimedia.org/enwiki/20060816/ reports on the 7z file all is OK and it is 5.1 Gb but I know it is incomplete it just stops in the middle of an article http://download.wikimedia.org/enwiki/20060810/ failed http://download.wikimedia.org/enwiki/20060803/ in progress http://download.wikimedia.org/enwiki/20060717/ OK http://download.wikimedia.org/enwiki/20060702/ OK http://download.wikimedia.org/enwiki/20060619/ in progress I could go on: 2 or 3 OK in 10 older runs
Early this year there was no valid en: archive dump for over 4 months.
I proposed doing the largest dumps in incremental steps (say one job per letter of the alphabet and concat at the end), so that rerun after error would be less costly but Brion says there are no disk resources for that
As other people commented, the current situation helps to prevent forks ;)
So again I fully appreciate Brion can't be all things to al people. But please don't suggest the dump process is reliable enough.
Erik Zachte
"Erik Zachte" wrote:
I proposed doing the largest dumps in incremental steps (say one job per letter of the alphabet and concat at the end), so that rerun after error would be less costly but Brion says there are no disk resources for that
Why not? 26 files of 1/26 of the db would fill the same as a full dump. It may not be the same if they're stored directly compressed (then the full dumps are done twice, one on bz2 and another on 7z?) but at least bz2 allows storing of multiple bz2 at once (in fact files are stored in independent blocks) though its files are much larger.
Platonides wrote:
"Erik Zachte" wrote:
I proposed doing the largest dumps in incremental steps (say one job per letter of the alphabet and concat at the end), so that rerun after error would be less costly but Brion says there are no disk resources for that
Why not? 26 files of 1/26 of the db would fill the same as a full dump.
If you were to concatenate multiple bits in a single stream, it would either take a lot more disk space or you'd increase the run time by a few days to recompress everything.
Really though multiple chunks have less to do with disk space, than simply being more or less infinitely harder to manage and work with.
Actual improvements underway include fixing up the text dump runner to recover from database disconnection (the most common problem), made possible by the switch to PHP 5 and catchable exceptions for errors instead of having the script die out.
The next run of each wiki _should_ now be able to recover from disconnected or temporarily overloaded databases.
-- brion vibber (brion @ pobox.com)
Erik Zachte wrote:
Sometimes the jump job reports all is well when it is not (Brion knows this)
This part's now fixed; .bz2 failures will not report .7z success on the next run around (but could on the current run while the program's still running).
-- brion vibber (brion @ pobox.com)
Erik Zachte wrote:
Sometimes the jump job reports all is well when it is not (Brion knows this)
This part's now fixed; .bz2 failures will not report .7z success on the next run around (but could on the current run while the program's still running).
-- brion vibber (brion @ pobox.com)
If I can possibly please make a suggestion / request, one thing I would quite like from the download.wikipedia.org site is an index file somewhere (could be plain text or XML) that indicates what the latest valid individual dump files are for a given Wikipedia site.
The "latest" directory is not useful for this purpose (e.g. http://download.wikipedia.org/enwiki/latest/ points to files from approx Aug-17, which looks to be the latest dump where everything reported as succeeded; but the later dumps are still useful for most of the files, just not the really large ones). An index file of this type would supersede the http://download.wikipedia.org/enwiki/latest/ directory, and would probably live in http://download.wikipedia.org/enwiki/ .
Given that sometimes some dumps fail, it could be good to make automating the downloading and processing of dump files easier. That way dump consumers could have a cron job that say once a day would get the latest index file, and download the latest dump files they wanted, if they had been updated.
It might help here to show a rough mock-up example of the type of file I'm thinking of:
=================================================================== <mediawiki xsi:schemaLocation="http://download.wikipedia.org/xml/export-0.1/" version="0.1" xml:lang="en"> <siteinfo> <sitename>English Wikipedia</sitename> </siteinfo> <dump type="site_stats.sql.gz"> <desc>A few statistics such as the page count.</desc> <url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-site_stats.sql...</url> <size_in_bytes>451</size_in_bytes> <timestamp>2006-09-24T16:29:01Z</timestamp> <md5sum>e4defa79c36823c67ed4d937f8f7013c</md5sum> </dump> <dump type="pages-articles.xml.bz2"> <desc>Articles, templates, image descriptions, and primary meta-pages.</desc> <timestamp>2006-09-24T22:12:24Z</timestamp> <url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-pages-articles...</url> <md5sum>2742b1b4b131d9a28887823da91cf2a5</md5sum> <size_in_bytes>1710328527</size_in_bytes> </dump>
.... snip various dump entries ....
<dump type="pages-meta-history.xml.7z"> <desc>All pages with complete edit history (.7z)</desc> <timestamp>2006-08-16T12:55:00Z</timestamp> <url>http://download.wikipedia.org/enwiki/20060816/enwiki-20060816-pages-meta-his...</url> <md5sum>24160a71229bee02bb813825bf7413db</md5sum> <size_in_bytes>5132097632</size_in_bytes> </dump> </mediawiki> ===================================================================
... the above file is probably invalid XML and needs to tweaked and so forth, but hopefully it illustrates the idea (e.g. the pages-articles.xml.bz2 entry is recent, whereas the pages-meta-history.xml.7z file is a month older, but both represent the latest valid dump for that type of file). Someone who for example only wants the "All pages with complete edit history (.7z)" file can download this file once a day, then when the entry changes have it download the file, verify the size in bytes matches, verify the Md5sum matches, and if everything is good, extract the file, maybe then locally verify it's a valid XML file, and if it's all still good process the file in an automated way. Also after every individual dump file was successfully created, the index file would probably have to be updated (to ensure it was always current). At the moment the above information is I think currently on the download.wikipedia.org site, but it's just scattered out over a number of different places; this would basically unify all that information into a nice useful data format.
All the best, Nick.
Nick Jenkins wrote:
Erik Zachte wrote:
Sometimes the jump job reports all is well when it is not (Brion knows this)
This part's now fixed; .bz2 failures will not report .7z success on the next run around (but could on the current run while the program's still running).
-- brion vibber (brion @ pobox.com)
If I can possibly please make a suggestion / request, one thing I would quite like from the download.wikipedia.org site is an index file somewhere (could be plain text or XML) that indicates what the latest valid individual dump files are for a given Wikipedia site.
That's what the latest directory is for.
The "latest" directory is not useful for this purpose (e.g. http://download.wikipedia.org/enwiki/latest/ points to files from approx Aug-17, which looks to be the latest dump where everything reported as succeeded;
Right, for consistency.
-- brion vibber (brion @ pobox.com)
The "latest" directory is not useful for this purpose (e.g. http://download.wikipedia.org/enwiki/latest/ points to files from approx Aug-17, which looks to be the latest dump where everything reported as succeeded;
Right, for consistency.
Yes, but how often does somebody intentionally download and use every single file from a dump? Most people need either one or two of the dump files; the rest are simply irrelevant to them.
The latest directory is using a lowest-common-denominator approach (latest run where everything succeeded). This file would essentially be a highest-common-denominator approach (latest successful version of each individual file). Maybe both have their place.
However, I've realised it would be useful to include for each data type the date on which the dump run was started, e.g.: --------------------------------------- <dump type="site_stats.sql.gz"> <desc>A few statistics such as the page count.</desc> <url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-site_stats.sql...</url> + <dump_run>20060925</dump_run> <size_in_bytes>451</size_in_bytes> <timestamp>2006-09-24T16:29:01Z</timestamp> <md5sum>e4defa79c36823c67ed4d937f8f7013c</md5sum> </dump> ---------------------------------------
.. that way anyone that needs multiple files can hold off downloading them until all the "dump_run" fields match up, so as to more easily avoid problems of mixing files from different dumps. (It's true that this field can currently be pulled from the directory in the <url> field, but if a different field is used then the url can point just about anywhere, such as potentially using different hostnames for different dumps, or changing directory structure.)
Anyway, it's just a suggestion, and if you don't like it, well, there's not much I can do about it ;-)
All the best, Nick.
Nick:
.. that way anyone that needs multiple files can hold off downloading them until all the "dump_run" fields match up, so as to more easily avoid problems of mixing files from different dumps. (It's true that this field can currently be pulled from the directory in the <url> field, but if a different field is used then the url can point just about anywhere, such as potentially using different hostnames for different dumps, or changing directory structure.)
Not for the /latest What about making the -latest- files an http redirect to the real ones instead of symlinks? You can grab -latest-file swith the correct names, you know the date... The only possible problem i see are broken utilities to download the wiki which doesn't properly handle the full http. However, as there is no system which uses them, i don't think it to be very used.
About the meaning of 'latest', we could split in latestfile / latestcomplete with both senses.
Well, I think Nick proposal would be a big improvement indeed...
Presently, the Python tool I' m developing for quantitative analysis based on db dumps has to loop searching the latest valid dump for any given wikipedia (trying every posible date in the url until I find the correct file...).
Despite that, reading Erik's comments I' ve realized that I should also check the size of dumps looking for odd values. But... who knows the "correct" size of a certain dump? (ok, other than enwiki).
So info about dates, size, and md5 sum for every valid dump is *really* interesting.
Nick Jenkins nickpj@gmail.com escribió: > > The "latest" directory is not useful for this purpose (e.g. http://download.wikipedia.org/enwiki/latest/ points to
files from approx Aug-17, which looks to be the latest dump where everything reported as succeeded;
Right, for consistency.
Yes, but how often does somebody intentionally download and use every single file from a dump? Most people need either one or two of the dump files; the rest are simply irrelevant to them.
The latest directory is using a lowest-common-denominator approach (latest run where everything succeeded). This file would essentially be a highest-common-denominator approach (latest successful version of each individual file). Maybe both have their place.
However, I've realised it would be useful to include for each data type the date on which the dump run was started, e.g.: ---------------------------------------
A few statistics such as the page count. http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-site_stats.sql... + 20060925 451 2006-09-24T16:29:01Z e4defa79c36823c67ed4d937f8f7013c
---------------------------------------
.. that way anyone that needs multiple files can hold off downloading them until all the "dump_run" fields match up, so as to more easily avoid problems of mixing files from different dumps. (It's true that this field can currently be pulled from the directory in the field, but if a different field is used then the url can point just about anywhere, such as potentially using different hostnames for different dumps, or changing directory structure.)
Anyway, it's just a suggestion, and if you don't like it, well, there's not much I can do about it ;-)
All the best, Nick.
_______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
---------------------------------
LLama Gratis a cualquier PC del Mundo. Llamadas a fijos y móviles desde 1 céntimo por minuto. http://es.voice.yahoo.com
wikitech-l@lists.wikimedia.org