The
"latest" directory is not useful for this purpose (e.g.
http://download.wikipedia.org/enwiki/latest/ points to
files from approx
Aug-17, which looks to be the latest dump where everything reported as succeeded;
Right, for consistency.
Yes, but how often does somebody intentionally download and use every single file from a
dump? Most people need either one or two of
the dump files; the rest are simply irrelevant to them.
The latest directory is using a lowest-common-denominator approach (latest run where
everything succeeded). This file would
essentially be a highest-common-denominator approach (latest successful version of each
individual file). Maybe both have their
place.
However, I've realised it would be useful to include for each data type the date on
which the dump run was started, e.g.:
---------------------------------------
<dump type="site_stats.sql.gz">
<desc>A few statistics such as the page count.</desc>
<url>http://download.wikipedia.org/enwiki/20060925/enwiki-20060925-site_stats.sql.gz</url>
+ <dump_run>20060925</dump_run>
<size_in_bytes>451</size_in_bytes>
<timestamp>2006-09-24T16:29:01Z</timestamp>
<md5sum>e4defa79c36823c67ed4d937f8f7013c</md5sum>
</dump>
---------------------------------------
.. that way anyone that needs multiple files can hold off downloading them until all the
"dump_run" fields match up, so as to more
easily avoid problems of mixing files from different dumps. (It's true that this field
can currently be pulled from the directory in
the <url> field, but if a different field is used then the url can point just about
anywhere, such as potentially using different
hostnames for different dumps, or changing directory structure.)
Anyway, it's just a suggestion, and if you don't like it, well, there's not
much I can do about it ;-)
All the best,
Nick.