Re: [Xmldatadumps-l] 7/22 enwiki dump pages-meta-history

3 Aug 2011


      Ariel T. Glenn wrote:
...
In general we don't recombine the pieces; it is extremely easy for the
enduser to do so if a single file is really needed.  I probably have a
shell (bash) script around here that would do it.  But people have
expressed a preference for more smaller files, either so that they can
process a pice that contains the pages they like, or so that they can
process the data in parallel.
Bash recipe to create a single xml from the splitted ones:
( 7z e -so enwiki-pages-meta-history1.xml.7z | head -n -2 ;
7z e -so enwiki-pages-meta-history2.xml.7z | tail -n +32 - | head -n -2
...
7z e -so enwiki-pages-meta-historyN.xml.7z | tail -n +32 - ) > 
enwiki-pages-meta-history-full.xml
Note: The value 32 varies depending on the dump version and wiki.
It's one more than the value given by
  7z e -so enwiki-pages-meta-history1.xml.7z 2> /dev/null | grep -n 
'</siteinfo>'
Still, I would prefer to use a smarter program instead of so many tail 
and head filters which will analyse millions of lines just to remove a few.
...
Which brings up a point: a few months back I mentioned that I'd like to
produce a large number, ~ 125, small files for the en wikipedia history
dumps, rather than the 30 larger ones we produce now. . These files
would have the first and last page id of their contents embedded in the
filename.   Once again I would not plan to recombine these files; it
adds extra days to the run after the data has already been made
available for download.  I'd like people's comments on this.
Ariel
That's fine for me.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] 7/22 enwiki dump pages-meta-history