I assume you have all seen https://phabricator.wikimedia.org/T116907
"Explore the possibility of splitting dewiki and frwiki into smaller chunks"
If not, and you ever use frwiki or dewiki page content dumps, go read it now. Or if you know of anyone who uses them, please nag them to go read it.
The upshot is that we will most likely on January 1st 2016 do all further dump runs of frwiki and dewiki with so-called 'checkpointing'. This change is being made so that if one of these jobs is interrupted for whatever reason, it can be rerun with only the missing page ranges dumped on the second run, saving quite a lot of time. A second reason is to ease the burden on downloaders, who generally prefer downloading several smaller files rather than one large 90gb file (example taken from dewiki history dumps).
WHat does this mean in practice for you, users of the dumps? It means that filenames for the page content (article, meta-current and meta -history) dumps will have pXXXXpYYYY in the names, where XXXX is the first page id in the file and YYY is the last pageid in the file. For examples of this you can look at the enwiki page content dumps, which have been running that way for a few years now.
This notice should give you plenty of time to convert your tools to use the new nameing scheme. I encourage you to forward this message to other appropriate people or groups.
Thanks,
Ariel
Dear Ariel,
2015-12-01 15:54 GMT+01:00 Ariel T. Glenn aglenn@wikimedia.org:
WHat does this mean in practice for you, users of the dumps? It means that filenames for the page content (article, meta-current and meta -history) dumps will have pXXXXpYYYY in the names, where XXXX is the first page id in the file and YYY is the last pageid in the file. For examples of this you can look at the enwiki page content dumps, which have been running that way for a few years now.
If I look at https://dumps.wikimedia.org/enwiki/latest/ right now, I see both the enwiki-latest-pages-articles.xml.bz2 I'm used to and the enwiki-latest-pages-articlesX.xml-pYpZ.bz2 you're talking about. Does it mean that the two will coexist for some time?
This notice should give you plenty of time to convert your tools to use the new nameing scheme. I encourage you to forward this message to other appropriate people or groups.
I'm currently doing something very simple yet also quite efficient that looks like:
wget -q http://dumps.wikimedia.org/frwiki/$date/frwiki-$date-pages-articles.xml.bz2 -O - | bunzip2 | customProcess
What will be the canonical way to perform the same thing? Could we have an additional file with a *fixed* name which contains the list of the *variable* names of the small chunks, so that something like the following is possible?
( wget -q http://dumps.wikimedia.org/frwiki/$date/frwiki-$date-pages-articles.xml.bz2.... -O - | while read chunk; do wget -q http://dumps.wikimedia.org/frwiki/$date/$chunk -O - done ) | bunzip2 | customProcess
Thanks!
Στις 01-12-2015, ημέρα Τρι, και ώρα 17:02 +0100, ο/η Jérémie Roquet έγραψε:
Dear Ariel,
2015-12-01 15:54 GMT+01:00 Ariel T. Glenn aglenn@wikimedia.org:
WHat does this mean in practice for you, users of the dumps? It means that filenames for the page content (article, meta-current and meta -history) dumps will have pXXXXpYYYY in the names, where XXXX is the first page id in the file and YYY is the last pageid in the file. For examples of this you can look at the enwiki page content dumps, which have been running that way for a few years now.
If I look at https://dumps.wikimedia.org/enwiki/latest/ right now, I see both the enwiki-latest-pages-articles.xml.bz2 I'm used to and the enwiki-latest-pages-articlesX.xml-pYpZ.bz2 you're talking about. Does it mean that the two will coexist for some time?
For articles and meta-current, we always recombine the pieces. So you'll have one file for those. For full history, no.
This notice should give you plenty of time to convert your tools to use the new nameing scheme. I encourage you to forward this message to other appropriate people or groups.
I'm currently doing something very simple yet also quite efficient that looks like:
wget -q http://dumps.wikimedia.org/frwiki/$date/frwiki-$date-pages-ar ticles.xml.bz2 -O - | bunzip2 | customProcess
What will be the canonical way to perform the same thing? Could we have an additional file with a *fixed* name which contains the list of the *variable* names of the small chunks, so that something like the following is possible?
( wget -q http://dumps.wikimedia.org/frwiki/$date/frwiki-$date-pages- articles.xml.bz2.list -O - | while read chunk; do wget -q http://dumps.wikimedia.org/frwiki/$date/$chunk -O - done ) | bunzip2 | customProcess
Thanks!
You can get the names of the files from the md or sha file, looking for all filenames with 'pages-articles' in them, or whatever page content dump you like. I would suggest using that as the canonical list of files.
Ariel
2015-12-02 15:54 GMT+01:00 Ariel T. Glenn aglenn@wikimedia.org:
Στις 01-12-2015, ημέρα Τρι, και ώρα 17:02 +0100, ο/η Jérémie Roquet έγραψε:
If I look at https://dumps.wikimedia.org/enwiki/latest/ right now, I see both the enwiki-latest-pages-articles.xml.bz2 I'm used to and the enwiki-latest-pages-articlesX.xml-pYpZ.bz2 you're talking about. Does it mean that the two will coexist for some time?
For articles and meta-current, we always recombine the pieces. So you'll have one file for those. For full history, no.
That means no change for most use cases. Great!
What will be the canonical way to perform the same thing? Could we have an additional file with a *fixed* name which contains the list of the *variable* names of the small chunks, so that something like the following is possible?
( wget -q http://dumps.wikimedia.org/frwiki/$date/frwiki-$date-pages- articles.xml.bz2.list -O - | while read chunk; do wget -q http://dumps.wikimedia.org/frwiki/$date/$chunk -O - done ) | bunzip2 | customProcess
You can get the names of the files from the md or sha file, looking for all filenames with 'pages-articles' in them, or whatever page content dump you like. I would suggest using that as the canonical list of files.
Makes perfect sense to me. Thank you!
xmldatadumps-l@lists.wikimedia.org