Ariel T. Glenn wrote:
In general we don't recombine the pieces; it is
extremely easy for the
enduser to do so if a single file is really needed. I probably have a
shell (bash) script around here that would do it. But people have
expressed a preference for more smaller files, either so that they can
process a pice that contains the pages they like, or so that they can
process the data in parallel.
Bash recipe to create a single xml from the splitted ones:
( 7z e -so enwiki-pages-meta-history1.xml.7z | head -n -2 ;
7z e -so enwiki-pages-meta-history2.xml.7z | tail -n +32 - | head -n -2
...
7z e -so enwiki-pages-meta-historyN.xml.7z | tail -n +32 - ) >
enwiki-pages-meta-history-full.xml
Note: The value 32 varies depending on the dump version and wiki.
It's one more than the value given by
7z e -so enwiki-pages-meta-history1.xml.7z 2> /dev/null | grep -n
'</siteinfo>'
Still, I would prefer to use a smarter program instead of so many tail
and head filters which will analyse millions of lines just to remove a few.
Which brings up a point: a few months back I mentioned
that I'd like to
produce a large number, ~ 125, small files for the en wikipedia history
dumps, rather than the 30 larger ones we produce now. . These files
would have the first and last page id of their contents embedded in the
filename. Once again I would not plan to recombine these files; it
adds extra days to the run after the data has already been made
available for download. I'd like people's comments on this.
Ariel
That's fine for me.