Re: [Xmldatadumps-l] Request for input on de fr wiki breaking change to dumps!

1 Dec 2015

Dear Ariel,

2015-12-01 15:54 GMT+01:00 Ariel T. Glenn &lt;aglenn(a)wikimedia.org&gt;rg>:
...
  WHat does this mean in practice for you, users of the
dumps?  It means
 that filenames for the page content (article, meta-current and meta
 -history) dumps will have pXXXXpYYYY in the names, where XXXX is the
 first page id in the file and YYY is the last pageid in the file.  For
 examples of this you can look at the enwiki page content dumps, which
 have been running that way for a few years now. 
If I look at https://dumps.wikimedia.org/enwiki/latest/ right now, I
see both the enwiki-latest-pages-articles.xml.bz2 I'm used to and the
enwiki-latest-pages-articlesX.xml-pYpZ.bz2 you're talking about. Does
it mean that the two will coexist for some time?

...
  This notice should give you plenty of time to convert
your tools to use
 the new nameing scheme.  I encourage you to forward this message to
 other appropriate people or groups. 
I'm currently doing something very simple yet also quite efficient
that looks like:

wget -q http://dumps.wikimedia.org/frwiki/$date/frwiki-$date-pages-articles.xml.bz2
-O - | bunzip2 | customProcess

What will be the canonical way to perform the same thing? Could we
have an additional file with a *fixed* name which contains the list of
the *variable* names of the small chunks, so that something like the
following is possible?

( wget -q
http://dumps.wikimedia.org/frwiki/$date/frwiki-$date-pages-articles.xml.bz2…
-O - | while read chunk; do
   wget -q http://dumps.wikimedia.org/frwiki/$date/$chunk -O -
 done ) | bunzip2 | customProcess

Thanks!

-- 
Jérémie

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] Request for input on de fr wiki breaking change to dumps!