--- El mié, 25/2/09, Robert Ullmann <rlullmann(a)gmail.com> escribió:
De: Robert Ullmann <rlullmann(a)gmail.com>
Asunto: Re: [Wikitech-l] Dump processes seem to be dead
Para: "Wikimedia developers" <wikitech-l(a)lists.wikimedia.org>
Fecha: miércoles, 25 febrero, 2009 2:09
you
yourself suggested page id.
I suggest the history be partitioned into
"blocks" by *revision ID*
I've checked some alternatives to slice the huge dump files in chunks with a more
manageable size. I first thought about dividing the blocks by rev_id, like you suggest.
Then, I realized that it can pose some problems for parsers recovering information, since
revisions corresponding to the same page may fall in different dump files.
Once you have surpassed the page_id tag, you cannot remember it if the process stops due
to some error, unless you save breakpoint information to recover it later on, when you
restart the process again.
Partitioning by page_id, you can maintain all revs of the same page in the same block,
while you don't disturb algorithms looking for individual revisions.
Yes, the chunks would be slightly bigger, but the difference is not that much with either
7zip or bzip2, and you favor simplicity of recovering tools.
Best,
F.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l