Ariel T. Glenn wrote:
In general we don't recombine the pieces; it is extremely easy for the enduser to do so if a single file is really needed. I probably have a shell (bash) script around here that would do it. But people have expressed a preference for more smaller files, either so that they can process a pice that contains the pages they like, or so that they can process the data in parallel.
Bash recipe to create a single xml from the splitted ones: ( 7z e -so enwiki-pages-meta-history1.xml.7z | head -n -2 ; 7z e -so enwiki-pages-meta-history2.xml.7z | tail -n +32 - | head -n -2 ... 7z e -so enwiki-pages-meta-historyN.xml.7z | tail -n +32 - ) > enwiki-pages-meta-history-full.xml
Note: The value 32 varies depending on the dump version and wiki. It's one more than the value given by 7z e -so enwiki-pages-meta-history1.xml.7z 2> /dev/null | grep -n '</siteinfo>'
Still, I would prefer to use a smarter program instead of so many tail and head filters which will analyse millions of lines just to remove a few.
Which brings up a point: a few months back I mentioned that I'd like to produce a large number, ~ 125, small files for the en wikipedia history dumps, rather than the 30 larger ones we produce now. . These files would have the first and last page id of their contents embedded in the filename. Once again I would not plan to recombine these files; it adds extra days to the run after the data has already been made available for download. I'd like people's comments on this.
Ariel
That's fine for me.