Dear All,
I posted at http://trust.cse.ucsc.edu/Code a tiny bit of code that enables you to split a Wikipedia .xml dump into n-page chunks, for a given n. The chunks are then immediately (on the fly) compressed with a compression algorithm you can choose (default: gzip).
We are using this to split a dump, to be able to analyze it in pieces in a more manageable way. We hope the code is useful to others as well. (It is a tiny and trivial piece of code, btw).
Luca
Luca de Alfaro wrote:
I posted at http://trust.cse.ucsc.edu/Code a tiny bit of code that enables you to split a Wikipedia .xml dump into n-page chunks, for a given n. The chunks are then immediately (on the fly) compressed with a compression algorithm you can choose (default: gzip).
We are using this to split a dump, to be able to analyze it in pieces in a more manageable way. We hope the code is useful to others as well. (It is a tiny and trivial piece of code, btw).
Thanks!
-- brion vibber (brion @ wikimedia.org)
On 10/1/07, Luca de Alfaro luca@soe.ucsc.edu wrote:
Dear All,
I posted at http://trust.cse.ucsc.edu/Code a tiny bit of code that enables you to split a Wikipedia .xml dump into n-page chunks, for a given n.
Cool, but I get: "You are not allowed to view this page."
wikitech-l@lists.wikimedia.org