Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps

10 Jul 2013

On Mon, Jul 8, 2013 at 6:53 AM, Randall Farmer &lt;randall(a)wawd.com&gt; wrote:

...
   Keeping the
dumps in a text-based format doesn't make sense, because  that can't be
updated efficiently, which is the whole reason for the new
 dumps.

 First, glad to see there's motion here.

 It's definitely true that recompressing the entire history to .bz2 or .7z
 goes very, very slowly. Also, I don't know of an existing tool that lets
 you just insert new data here and there without compressing all of the
 unchanged data as well. Those point towards some sort of format change.

 I'm not sure a new format has to be sparse or indexed to get around those
 two big problems.

 For full-history dumps, delta coding (or the related idea of long-range
 redundancy compression) runs faster than bzip2 or 7z and produces good
 compression ratios on full-history dumps, based on some
tests<https://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3>
 . (I'm going to focus mostly on full-history dumps here because they're
 the hard case and one Ariel said is currently painful--not everything here
 will apply to latest-revs dumps.)

 For inserting data, you do seemingly need to break the file up into
 independently-compressed sections containing just one page's revision
 history or a fragment of it, so you can add new diff(s) to a page's
 revision history without decompressing and recompressing the previous
 revisions. (Removing previously-dumped revisions is another story, but it's
 rarer.) You'd be in new territory just doing that; I don't know of existing
 compression tools that really allow that.

 You could do those two things, though, while still keeping full-history
 dumps a once-every-so-often batch process that produces a sorted file. The
 time to rewrite the file, stripped of the big compression steps, could be
 bearable--a disk can read or write about 100 MB/s, so just copying the 70G
 of the .7z enwiki dumps is well under an hour; if the part bound by CPU and
 other steps is smallish, you're OK.

 A format like the proposed one, with revisions inserted wherever there's
 free space when they come in, will also eventually fragment the revision
 history for one page (I think Ariel alluded to this in some early notes).
 Unlike sequential read/writes, seeks are something HDDs are sadly pretty
 slow at (hence the excitement about solid-state disks); if thousands of
 revisions are coming in a day, it eventually becomes slow to read things in
 the old page/revision order, and you need fancy techniques to defrag (maybe
 a big external-memory sort <http://en.wikipedia.org/wiki/External_sorting>)
 or you need to only read the dump on fast hardware that can handle the
 seeks. Doing occasional batch jobs that produce sorted files could help
 avoid the fragmentation question.

These are some interesting ideas.

You're right that the copying the whole dump is fast enough (it would
probably add about an hour to a process that currently takes several days).
But it would also pretty much force the use of delta compression. And while
I would like to use delta compression, I don't think it's a good idea to be
forced to use it, because I might not have the time for it or it might not
be good enough.

Because of that, I decided to stay with my indexed approach.

...
   There's a great quote about the difficulty of
"constructing a software
 design...to make it so simple that there are obviously no deficiencies."
 (Wikiquote came through with the full text/attribution, of
course<http://en.wikiquote.org/wiki/C._A._R._Hoare>re>.)
 I admit it's tricky and people can disagree about what's simple enough or
 even what approach is simpler of two choices, but it's something to strive
 for.

 Anyway, I'm wary about going into the technical weeds of other folks'
 projects, because, hey, it's your project! I'm trying to map out the
 options in the hope that you could get a product you're happier with and
 maybe give you more time in a tight three-month schedule to improve on your
 work and not just complete it. Whatever you do, good luck and I'm
 interested to see the results!

Feel free to comment more. I am the one implementing the project, but
that's all. Input from others is always welcome.

Petr Onderka

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] [Wikitech-l] Suggested file format of new incremental dumps