Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps - Wikitech-l

10 Jul 2013


      On Mon, Jul 8, 2013 at 6:53 AM, Randall Farmer randall@wawd.com wrote:
...
...
Keeping the dumps in a text-based format doesn't make sense, because
that can't be updated efficiently, which is the whole reason for the new
dumps.
First, glad to see there's motion here.
It's definitely true that recompressing the entire history to .bz2 or .7z
goes very, very slowly. Also, I don't know of an existing tool that lets
you just insert new data here and there without compressing all of the
unchanged data as well. Those point towards some sort of format change.
I'm not sure a new format has to be sparse or indexed to get around those
two big problems.
For full-history dumps, delta coding (or the related idea of long-range
redundancy compression) runs faster than bzip2 or 7z and produces good
compression ratios on full-history dumps, based on some testshttps://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3
. (I'm going to focus mostly on full-history dumps here because they're
the hard case and one Ariel said is currently painful--not everything here
will apply to latest-revs dumps.)
For inserting data, you do seemingly need to break the file up into
independently-compressed sections containing just one page's revision
history or a fragment of it, so you can add new diff(s) to a page's
revision history without decompressing and recompressing the previous
revisions. (Removing previously-dumped revisions is another story, but it's
rarer.) You'd be in new territory just doing that; I don't know of existing
compression tools that really allow that.
You could do those two things, though, while still keeping full-history
dumps a once-every-so-often batch process that produces a sorted file. The
time to rewrite the file, stripped of the big compression steps, could be
bearable--a disk can read or write about 100 MB/s, so just copying the 70G
of the .7z enwiki dumps is well under an hour; if the part bound by CPU and
other steps is smallish, you're OK.
A format like the proposed one, with revisions inserted wherever there's
free space when they come in, will also eventually fragment the revision
history for one page (I think Ariel alluded to this in some early notes).
Unlike sequential read/writes, seeks are something HDDs are sadly pretty
slow at (hence the excitement about solid-state disks); if thousands of
revisions are coming in a day, it eventually becomes slow to read things in
the old page/revision order, and you need fancy techniques to defrag (maybe
a big external-memory sort http://en.wikipedia.org/wiki/External_sorting)
or you need to only read the dump on fast hardware that can handle the
seeks. Doing occasional batch jobs that produce sorted files could help
avoid the fragmentation question.
These are some interesting ideas.
You're right that the copying the whole dump is fast enough (it would
probably add about an hour to a process that currently takes several days).
But it would also pretty much force the use of delta compression. And while
I would like to use delta compression, I don't think it's a good idea to be
forced to use it, because I might not have the time for it or it might not
be good enough.
Because of that, I decided to stay with my indexed approach.
...
There's a great quote about the difficulty of "constructing a software
design...to make it so simple that there are obviously no deficiencies."
(Wikiquote came through with the full text/attribution, of coursehttp://en.wikiquote.org/wiki/C._A._R._Hoare.)
I admit it's tricky and people can disagree about what's simple enough or
even what approach is simpler of two choices, but it's something to strive
for.
Anyway, I'm wary about going into the technical weeds of other folks'
projects, because, hey, it's your project! I'm trying to map out the
options in the hope that you could get a product you're happier with and
maybe give you more time in a tight three-month schedule to improve on your
work and not just complete it. Whatever you do, good luck and I'm
interested to see the results!
Feel free to comment more. I am the one implementing the project, but
that's all. Input from others is always welcome.
Petr Onderka