Στις 01-12-2015, ημέρα Τρι, και ώρα 17:02 +0100, ο/η Jérémie Roquet
2015-12-01 15:54 GMT+01:00 Ariel T. Glenn <aglenn(a)wikimedia.org>rg>:
WHat does this mean in practice for you, users of
the dumps? It
that filenames for the page content (article, meta-current and meta
-history) dumps will have pXXXXpYYYY in the names, where XXXX is
first page id in the file and YYY is the last pageid in the file.
examples of this you can look at the enwiki page content dumps,
have been running that way for a few years now.
If I look at https://dumps.wikimedia.org/enwiki/latest/
right now, I
see both the enwiki-latest-pages-articles.xml.bz2 I'm used to and the
enwiki-latest-pages-articlesX.xml-pYpZ.bz2 you're talking about. Does
it mean that the two will coexist for some time?
For articles and meta-current, we always recombine the pieces. So
you'll have one file for those. For full history, no.
should give you plenty of time to convert your tools to
the new nameing scheme. I encourage you to forward this message to
other appropriate people or groups.
I'm currently doing something very simple yet also quite efficient
that looks like:
wget -q http://dumps.wikimedia.org/frwiki/$date/frwiki-$date-pages-ar
-O - | bunzip2 | customProcess
What will be the canonical way to perform the same thing? Could we
have an additional file with a *fixed* name which contains the list
the *variable* names of the small chunks, so that something like the
following is possible?
( wget -q http://dumps.wikimedia.org/frwiki/$date/frwiki-$date-pages-
-O - | while read chunk; do
wget -q http://dumps.wikimedia.org/frwiki/$date/$chunk
done ) | bunzip2 | customProcess
You can get the names of the files from the md or sha file, looking for
all filenames with 'pages-articles' in them, or whatever page content
dump you like. I would suggest using that as the canonical list of