Re: [Xmldatadumps-l] 7/22 enwiki dump pages-meta-history

3 Aug 2011

Ariel T. Glenn wrote:
...
  In general we don't recombine the pieces; it is
extremely easy for the
 enduser to do so if a single file is really needed.  I probably have a
 shell (bash) script around here that would do it.  But people have
 expressed a preference for more smaller files, either so that they can
 process a pice that contains the pages they like, or so that they can
 process the data in parallel. 
Bash recipe to create a single xml from the splitted ones:
( 7z e -so enwiki-pages-meta-history1.xml.7z | head -n -2 ;
7z e -so enwiki-pages-meta-history2.xml.7z | tail -n +32 - | head -n -2
...
7z e -so enwiki-pages-meta-historyN.xml.7z | tail -n +32 - ) > 
enwiki-pages-meta-history-full.xml

Note: The value 32 varies depending on the dump version and wiki.
It's one more than the value given by
  7z e -so enwiki-pages-meta-history1.xml.7z 2> /dev/null | grep -n 
'</siteinfo>'

Still, I would prefer to use a smarter program instead of so many tail 
and head filters which will analyse millions of lines just to remove a few.

...
  Which brings up a point: a few months back I mentioned
that I'd like to
 produce a large number, ~ 125, small files for the en wikipedia history
 dumps, rather than the 30 larger ones we produce now. . These files
 would have the first and last page id of their contents embedded in the
 filename.   Once again I would not plan to recombine these files; it
 adds extra days to the run after the data has already been made
 available for download.  I'd like people's comments on this.

 Ariel 
That's fine for me.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] 7/22 enwiki dump pages-meta-history