2015-12-02 15:54 GMT+01:00 Ariel T. Glenn <aglenn(a)wikimedia.org>rg>:
Στις 01-12-2015, ημέρα Τρι, και ώρα 17:02 +0100, ο/η
Jérémie Roquet
έγραψε:
If I look at
https://dumps.wikimedia.org/enwiki/latest/ right now, I
see both the enwiki-latest-pages-articles.xml.bz2 I'm used to and the
enwiki-latest-pages-articlesX.xml-pYpZ.bz2 you're talking about. Does
it mean that the two will coexist for some time?
For articles and meta-current, we always recombine the pieces. So
you'll have one file for those. For full history, no.
That means no change for most use cases. Great!
What will be
the canonical way to perform the same thing? Could we
have an additional file with a *fixed* name which contains the list
of
the *variable* names of the small chunks, so that something like the
following is possible?
( wget -q
http://dumps.wikimedia.org/frwiki/$date/frwiki-$date-pages-
articles.xml.bz2.list
-O - | while read chunk; do
wget -q
http://dumps.wikimedia.org/frwiki/$date/$chunk -O -
done ) | bunzip2 | customProcess
You can get the names of the files from the md or sha file, looking for
all filenames with 'pages-articles' in them, or whatever page content
dump you like. I would suggest using that as the canonical list of
files.
Makes perfect sense to me. Thank you!
--
Jérémie