new dumps ?

List overview All Threads
Download

newer

older

general filter

MediaWiki automated test run...

Erik Zachte

26 Sep 2006 26 Sep '06

6:52 p.m.

Steve Summit

...

And in case no one's made the observation: after just a couple of initial hiccups (affecting, not surprisingly, only the biggest wikis), it seems to be working very well, with all dumps successfully up-to-date, on a cycle of just a few days. http://download.wikimedia.org/ is a very pretty picture now. Well done!

Sorry Steve you are quite mistaken. I'm probably the one person looking at the xml download progress report most often, as I wait impatiently for a good moment to run wikistats after month has completed

Dumps for largest Wikipedias still fail very frequently. There has not been a useful English dumps for months and maybe a handful in a whole year. Sometimes the jump job reports all is well when it is not (Brion knows this)

I hate to chase Brion because he has thousand obligations but dump process is pretty unstable still

Early this year there was no valid en: archive dump for over 4 months.

I proposed doing the largest dumps in incremental steps (say one job per letter of the alphabet and concat at the end), so that rerun after error would be less costly but Brion says there are no disk resources for that

As other people commented, the current situation helps to prevent forks ;)

So again I fully appreciate Brion can't be all things to al people. But please don't suggest the dump process is reliable enough.

Erik Zachte

Show replies by date

Platonides

27 Sep 27 Sep

5:14 p.m.

"Erik Zachte" wrote:

...

I proposed doing the largest dumps in incremental steps (say one job per letter of the alphabet and concat at the end), so that rerun after error would be less costly but Brion says there are no disk resources for that

Why not? 26 files of 1/26 of the db would fill the same as a full dump. It may not be the same if they're stored directly compressed (then the full dumps are done twice, one on bz2 and another on 7z?) but at least bz2 allows storing of multiple bz2 at once (in fact files are stored in independent blocks) though its files are much larger.

Brion Vibber

5:44 p.m.

Platonides wrote:

...

"Erik Zachte" wrote:

...
I proposed doing the largest dumps in incremental steps (say one job per letter of the alphabet and concat at the end), so that rerun after error would be less costly but Brion says there are no disk resources for that

Why not? 26 files of 1/26 of the db would fill the same as a full dump.

If you were to concatenate multiple bits in a single stream, it would either take a lot more disk space or you'd increase the run time by a few days to recompress everything.

Really though multiple chunks have less to do with disk space, than simply being more or less infinitely harder to manage and work with.

Actual improvements underway include fixing up the text dump runner to recover from database disconnection (the most common problem), made possible by the switch to PHP 5 and catchable exceptions for errors instead of having the script die out.

The next run of each wiki _should_ now be able to recover from disconnected or temporarily overloaded databases.

-- brion vibber (brion @ pobox.com)

Brion Vibber

6:16 p.m.

Erik Zachte wrote:

...

Sometimes the jump job reports all is well when it is not (Brion knows this)

This part's now fixed; .bz2 failures will not report .7z success on the next run around (but could on the current run while the program's still running).

-- brion vibber (brion @ pobox.com)

Nick Jenkins

28 Sep 28 Sep

1:28 a.m.

...

Erik Zachte wrote:

...
Sometimes the jump job reports all is well when it is not (Brion knows this)

This part's now fixed; .bz2 failures will not report .7z success on the next run around (but could on the current run while the program's still running).

-- brion vibber (brion @ pobox.com)

If I can possibly please make a suggestion / request, one thing I would quite like from the download.wikipedia.org site is an index file somewhere (could be plain text or XML) that indicates what the latest valid individual dump files are for a given Wikipedia site.

The "latest" directory is not useful for this purpose (e.g. http://download.wikipedia.org/enwiki/latest/ points to files from approx Aug-17, which looks to be the latest dump where everything reported as succeeded; but the later dumps are still useful for most of the files, just not the really large ones). An index file of this type would supersede the http://download.wikipedia.org/enwiki/latest/ directory, and would probably live in http://download.wikipedia.org/enwiki/ .

Given that sometimes some dumps fail, it could be good to make automating the downloading and processing of dump files easier. That way dump consumers could have a cron job that say once a day would get the latest index file, and download the latest dump files they wanted, if they had been updated.

It might help here to show a rough mock-up example of the type of file I'm thinking of:

=================================================================== <mediawiki xsi:schemaLocation="http://download.wikipedia.org/xml/export-0.1/" version="0.1" xml:lang="en"> <siteinfo> <sitename>English Wikipedia</sitename> </siteinfo> <dump type="site_stats.sql.gz"> <desc>A few statistics such as the page count.</desc> <url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-site_stats.sql...</url> <size_in_bytes>451</size_in_bytes> <timestamp>2006-09-24T16:29:01Z</timestamp> <md5sum>e4defa79c36823c67ed4d937f8f7013c</md5sum> </dump> <dump type="pages-articles.xml.bz2"> <desc>Articles, templates, image descriptions, and primary meta-pages.</desc> <timestamp>2006-09-24T22:12:24Z</timestamp> <url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-pages-articles...</url> <md5sum>2742b1b4b131d9a28887823da91cf2a5</md5sum> <size_in_bytes>1710328527</size_in_bytes> </dump>

.... snip various dump entries ....

<dump type="pages-meta-history.xml.7z"> <desc>All pages with complete edit history (.7z)</desc> <timestamp>2006-08-16T12:55:00Z</timestamp> <url>http://download.wikipedia.org/enwiki/20060816/enwiki-20060816-pages-meta-his...</url> <md5sum>24160a71229bee02bb813825bf7413db</md5sum> <size_in_bytes>5132097632</size_in_bytes> </dump> </mediawiki> ===================================================================

... the above file is probably invalid XML and needs to tweaked and so forth, but hopefully it illustrates the idea (e.g. the pages-articles.xml.bz2 entry is recent, whereas the pages-meta-history.xml.7z file is a month older, but both represent the latest valid dump for that type of file). Someone who for example only wants the "All pages with complete edit history (.7z)" file can download this file once a day, then when the entry changes have it download the file, verify the size in bytes matches, verify the Md5sum matches, and if everything is good, extract the file, maybe then locally verify it's a valid XML file, and if it's all still good process the file in an automated way. Also after every individual dump file was successfully created, the index file would probably have to be updated (to ensure it was always current). At the moment the above information is I think currently on the download.wikipedia.org site, but it's just scattered out over a number of different places; this would basically unify all that information into a nice useful data format.

All the best, Nick.

Brion Vibber

10:45 a.m.

Nick Jenkins wrote:

...

...
Erik Zachte wrote:

...
Sometimes the jump job reports all is well when it is not (Brion knows this)

This part's now fixed; .bz2 failures will not report .7z success on the next run around (but could on the current run while the program's still running).

-- brion vibber (brion @ pobox.com)

If I can possibly please make a suggestion / request, one thing I would quite like from the download.wikipedia.org site is an index file somewhere (could be plain text or XML) that indicates what the latest valid individual dump files are for a given Wikipedia site.

That's what the latest directory is for.

...

The "latest" directory is not useful for this purpose (e.g. http://download.wikipedia.org/enwiki/latest/ points to files from approx Aug-17, which looks to be the latest dump where everything reported as succeeded;

Right, for consistency.

-- brion vibber (brion @ pobox.com)

Nick Jenkins

9:04 p.m.

...

...
The "latest" directory is not useful for this purpose (e.g. http://download.wikipedia.org/enwiki/latest/ points to files from approx Aug-17, which looks to be the latest dump where everything reported as succeeded;

Right, for consistency.

Yes, but how often does somebody intentionally download and use every single file from a dump? Most people need either one or two of the dump files; the rest are simply irrelevant to them.

The latest directory is using a lowest-common-denominator approach (latest run where everything succeeded). This file would essentially be a highest-common-denominator approach (latest successful version of each individual file). Maybe both have their place.

However, I've realised it would be useful to include for each data type the date on which the dump run was started, e.g.: --------------------------------------- <dump type="site_stats.sql.gz"> <desc>A few statistics such as the page count.</desc> <url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-site_stats.sql...</url> + <dump_run>20060925</dump_run> <size_in_bytes>451</size_in_bytes> <timestamp>2006-09-24T16:29:01Z</timestamp> <md5sum>e4defa79c36823c67ed4d937f8f7013c</md5sum> </dump> ---------------------------------------

.. that way anyone that needs multiple files can hold off downloading them until all the "dump_run" fields match up, so as to more easily avoid problems of mixing files from different dumps. (It's true that this field can currently be pulled from the directory in the <url> field, but if a different field is used then the url can point just about anywhere, such as potentially using different hostnames for different dumps, or changing directory structure.)

Anyway, it's just a suggestion, and if you don't like it, well, there's not much I can do about it ;-)

All the best, Nick.

Platonides

1 Oct 1 Oct

8:04 a.m.

Nick:

...

.. that way anyone that needs multiple files can hold off downloading them until all the "dump_run" fields match up, so as to more easily avoid problems of mixing files from different dumps. (It's true that this field can currently be pulled from the directory in the <url> field, but if a different field is used then the url can point just about anywhere, such as potentially using different hostnames for different dumps, or changing directory structure.)

Not for the /latest What about making the -latest- files an http redirect to the real ones instead of symlinks? You can grab -latest-file swith the correct names, you know the date... The only possible problem i see are broken utilities to download the wiki which doesn't properly handle the full http. However, as there is no system which uses them, i don't think it to be very used.

About the meaning of 'latest', we could split in latestfile / latestcomplete with both senses.

Felipe Ortega

2 Oct 2 Oct

3:51 a.m.

Well, I think Nick proposal would be a big improvement indeed...

Presently, the Python tool I' m developing for quantitative analysis based on db dumps has to loop searching the latest valid dump for any given wikipedia (trying every posible date in the url until I find the correct file...).

Despite that, reading Erik's comments I' ve realized that I should also check the size of dumps looking for odd values. But... who knows the "correct" size of a certain dump? (ok, other than enwiki).

So info about dates, size, and md5 sum for every valid dump is *really* interesting.

Nick Jenkins nickpj@gmail.com escribió: > > The "latest" directory is not useful for this purpose (e.g. http://download.wikipedia.org/enwiki/latest/ points to

...

...
files from approx Aug-17, which looks to be the latest dump where everything reported as succeeded;

Right, for consistency.

Yes, but how often does somebody intentionally download and use every single file from a dump? Most people need either one or two of the dump files; the rest are simply irrelevant to them.

However, I've realised it would be useful to include for each data type the date on which the dump run was started, e.g.: ---------------------------------------

A few statistics such as the page count. http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-site_stats.sql... + 20060925 451 2006-09-24T16:29:01Z e4defa79c36823c67ed4d937f8f7013c

---------------------------------------

.. that way anyone that needs multiple files can hold off downloading them until all the "dump_run" fields match up, so as to more easily avoid problems of mixing files from different dumps. (It's true that this field can currently be pulled from the directory in the field, but if a different field is used then the url can point just about anywhere, such as potentially using different hostnames for different dumps, or changing directory structure.)

Anyway, it's just a suggestion, and if you don't like it, well, there's not much I can do about it ;-)

All the best, Nick.

_______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

---------------------------------

LLama Gratis a cualquier PC del Mundo. Llamadas a fijos y móviles desde 1 céntimo por minuto. http://es.voice.yahoo.com

6503

Age (days ago)

6509

Last active (days ago)

wikitech-l@lists.wikimedia.org

8 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Erik Zachte
Felipe Ortega
Nick Jenkins
Platonides