Well, I think Nick proposal would be a big improvement indeed...
Presently, the Python tool I' m developing for quantitative analysis based on db dumps has to loop searching the latest valid dump for any given wikipedia (trying every posible date in the url until I find the correct file...).
Despite that, reading Erik's comments I' ve realized that I should also check the size of dumps looking for odd values. But... who knows the "correct" size of a certain dump? (ok, other than enwiki).
So info about dates, size, and md5 sum for every valid dump is *really* interesting.
Nick Jenkins nickpj@gmail.com escribió: > > The "latest" directory is not useful for this purpose (e.g. http://download.wikipedia.org/enwiki/latest/ points to
files from approx Aug-17, which looks to be the latest dump where everything reported as succeeded;
Right, for consistency.
Yes, but how often does somebody intentionally download and use every single file from a dump? Most people need either one or two of the dump files; the rest are simply irrelevant to them.
The latest directory is using a lowest-common-denominator approach (latest run where everything succeeded). This file would essentially be a highest-common-denominator approach (latest successful version of each individual file). Maybe both have their place.
However, I've realised it would be useful to include for each data type the date on which the dump run was started, e.g.: ---------------------------------------
A few statistics such as the page count. http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-site_stats.sql... + 20060925 451 2006-09-24T16:29:01Z e4defa79c36823c67ed4d937f8f7013c
---------------------------------------
.. that way anyone that needs multiple files can hold off downloading them until all the "dump_run" fields match up, so as to more easily avoid problems of mixing files from different dumps. (It's true that this field can currently be pulled from the directory in the field, but if a different field is used then the url can point just about anywhere, such as potentially using different hostnames for different dumps, or changing directory structure.)
Anyway, it's just a suggestion, and if you don't like it, well, there's not much I can do about it ;-)
All the best, Nick.
_______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
---------------------------------
LLama Gratis a cualquier PC del Mundo. Llamadas a fijos y móviles desde 1 céntimo por minuto. http://es.voice.yahoo.com