--- El mié, 25/2/09, Robert Ullmann rlullmann@gmail.com escribió:
De: Robert Ullmann rlullmann@gmail.com Asunto: Re: [Wikitech-l] Dump processes seem to be dead Para: "Wikimedia developers" wikitech-l@lists.wikimedia.org Fecha: miércoles, 25 febrero, 2009 2:09 you yourself suggested page id.
I suggest the history be partitioned into "blocks" by *revision ID*
I've checked some alternatives to slice the huge dump files in chunks with a more manageable size. I first thought about dividing the blocks by rev_id, like you suggest. Then, I realized that it can pose some problems for parsers recovering information, since revisions corresponding to the same page may fall in different dump files.
Once you have surpassed the page_id tag, you cannot remember it if the process stops due to some error, unless you save breakpoint information to recover it later on, when you restart the process again.
Partitioning by page_id, you can maintain all revs of the same page in the same block, while you don't disturb algorithms looking for individual revisions.
Yes, the chunks would be slightly bigger, but the difference is not that much with either 7zip or bzip2, and you favor simplicity of recovering tools.
Best,
F.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l