Keeping the dumps in a text-based format doesn't
make sense, because that
can't be updated efficiently, which is the whole reason
for the new dumps.
First, glad to see there's motion here.
It's definitely true that recompressing the entire history to .bz2 or .7z
goes very, very slowly. Also, I don't know of an existing tool that lets
you just insert new data here and there without compressing all of the
unchanged data as well. Those point towards some sort of format change.
I'm not sure a new format has to be sparse or indexed to get around those
two big problems.
For full-history dumps, delta coding (or the related idea of long-range
redundancy compression) runs faster than bzip2 or 7z and produces good
compression ratios on full-history dumps, based on some
tests<https://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3>
. (I'm going to focus mostly on full-history dumps here because they're the
hard case and one Ariel said is currently painful--not everything here will
apply to latest-revs dumps.)
For inserting data, you do seemingly need to break the file up into
independently-compressed sections containing just one page's revision
history or a fragment of it, so you can add new diff(s) to a page's
revision history without decompressing and recompressing the previous
revisions. (Removing previously-dumped revisions is another story, but it's
rarer.) You'd be in new territory just doing that; I don't know of existing
compression tools that really allow that.
You could do those two things, though, while still keeping full-history
dumps a once-every-so-often batch process that produces a sorted file. The
time to rewrite the file, stripped of the big compression steps, could be
bearable--a disk can read or write about 100 MB/s, so just copying the 70G
of the .7z enwiki dumps is well under an hour; if the part bound by CPU and
other steps is smallish, you're OK.
A format like the proposed one, with revisions inserted wherever there's
free space when they come in, will also eventually fragment the revision
history for one page (I think Ariel alluded to this in some early notes).
Unlike sequential read/writes, seeks are something HDDs are sadly pretty
slow at (hence the excitement about solid-state disks); if thousands of
revisions are coming in a day, it eventually becomes slow to read things in
the old page/revision order, and you need fancy techniques to defrag (maybe
a big external-memory sort <http://en.wikipedia.org/wiki/External_sorting>)
or you need to only read the dump on fast hardware that can handle the
seeks. Doing occasional batch jobs that produce sorted files could help
avoid the fragmentation question.
There's a great quote about the difficulty of "constructing a software
design...to make it so simple that there are obviously no deficiencies."
(Wikiquote came through with the full text/attribution, of
course<http://en.wikiquote.org/wiki/C._A._R._Hoare>re>.)
I admit it's tricky and people can disagree about what's simple enough or
even what approach is simpler of two choices, but it's something to strive
for.
Anyway, I'm wary about going into the technical weeds of other folks'
projects, because, hey, it's your project! I'm trying to map out the
options in the hope that you could get a product you're happier with and
maybe give you more time in a tight three-month schedule to improve on your
work and not just complete it. Whatever you do, good luck and I'm
interested to see the results!
On Wed, Jul 3, 2013 at 7:04 AM, Petr Onderka <gsvick(a)gmail.com> wrote:
> A reply to all those who basically want to keep the current XML dumps:
>
> I have decided to change the primary way of reading the dumps: it will now
> be a command line application that outputs the data as uncompressed XML, in
> the same format as current dumps.
>
> This way, you should be able to use the new dumps with minimal changes to
> your code.
>
Keeping the dumps in a text-based format doesn't
make sense, because that
> can't be updated efficiently, which is the whole
reason for the new dumps.
>
> Petr Onderka
>
>
> On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen <byrial(a)vip.cybercity.dk>wrote;wrote:
>
>> Hi,
>>
>> As a regular of user of dump files I would not want a "fancy" file
format
>> with indexes stored as trees etc.
>>
>> I parse all the dump files (both for SQL tables and the XML files) with a
>> one pass parser which inserts the data I want (which sometimes is only a
>> small fraction of the total amount of data in the file) into my local
>> database. I will normally never store uncompressed dump files, but pipe the
>> uncompressed data directly from bunzip or gunzip to my parser to save disk
>> space. Therefore it is important to me that the format is simple enough for
>> a one pass parser.
>>
>> I cannot really imagine who would use a library with object oriented API
>> to read dump files. No matter what it would be inefficient and have fewer
>> features and possibilities than using a real database.
>>
>> I could live with a binary format, but I have doubts if it is a good
>> idea. It will be harder to take sure that your parser is working correctly,
>> and you have to consider things like endianness, size of integers, format
>> of floats etc. which give no problems in text formats. The binary files may
>> be smaller uncompressed (which I don't store anyway) but not necessary when
>> compressed, as the compression will do better on text files.
>>
>> Regards,
>> - Byrial
>>
>>
>> ______________________________**_________________
>> Xmldatadumps-l mailing list
>> Xmldatadumps-l(a)lists.**wikimedia.org <Xmldatadumps-l(a)lists.wikimedia.org>
>>
https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**l<https://…
>>
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>