> Keeping the dumps in a text-based format doesn't make sense, because that can't be updated efficiently, which is the whole reason for the new dumps.

First, glad to see there's motion here.

It's definitely true that recompressing the entire history to .bz2 or .7z goes very, very slowly. Also, I don't know of an existing tool that lets you just insert new data here and there without compressing all of the unchanged data as well. Those point towards some sort of format change.

I'm not sure a new format has to be sparse or indexed to get around those two big problems.

For full-history dumps, delta coding (or the related idea of long-range redundancy compression) runs faster than bzip2 or 7z and produces good compression ratios on full-history dumps, based on some tests(I'm going to focus mostly on full-history dumps here because they're the hard case and one Ariel said is currently painful--not everything here will apply to latest-revs dumps.)

For inserting data, you do seemingly need to break the file up into independently-compressed sections containing just one page's revision history or a fragment of it, so you can add new diff(s) to a page's revision history without decompressing and recompressing the previous revisions. (Removing previously-dumped revisions is another story, but it's rarer.) You'd be in new territory just doing that; I don't know of existing compression tools that really allow that.

You could do those two things, though, while still keeping full-history dumps a once-every-so-often batch process that produces a sorted file. The time to rewrite the file, stripped of the big compression steps, could be bearable--a disk can read or write about 100 MB/s, so just copying the 70G of the .7z enwiki dumps is well under an hour; if the part bound by CPU and other steps is smallish, you're OK.

A format like the proposed one, with revisions inserted wherever there's free space when they come in, will also eventually fragment the revision history for one page (I think Ariel alluded to this in some early notes). Unlike sequential read/writes, seeks are something HDDs are sadly pretty slow at (hence the excitement about solid-state disks); if thousands of revisions are coming in a day, it eventually becomes slow to read things in the old page/revision order, and you need fancy techniques to defrag (maybe a big external-memory sort) or you need to only read the dump on fast hardware that can handle the seeks. Doing occasional batch jobs that produce sorted files could help avoid the fragmentation question.

There's a great quote about the difficulty of "constructing a software design...to make it so simple that there are obviously no deficiencies." (Wikiquote came through with the full text/attribution, of course.) I admit it's tricky and people can disagree about what's simple enough or even what approach is simpler of two choices, but it's something to strive for.

Anyway, I'm wary about going into the technical weeds of other folks' projects, because, hey, it's your project! I'm trying to map out the options in the hope that you could get a product you're happier with and maybe give you more time in a tight three-month schedule to improve on your work and not just complete it. Whatever you do, good luck and I'm interested to see the results!


On Wed, Jul 3, 2013 at 7:04 AM, Petr Onderka <gsvick@gmail.com> wrote:
A reply to all those who basically want to keep the current XML dumps:

I have decided to change the primary way of reading the dumps: it will now be a command line application that outputs the data as uncompressed XML, in the same format as current dumps.

This way, you should be able to use the new dumps with minimal changes to your code.

Keeping the dumps in a text-based format doesn't make sense, because that can't be updated efficiently, which is the whole reason for the new dumps.

Petr Onderka


On Mon, Jul 1, 2013 at 11:10 PM, Byrial Jensen <byrial@vip.cybercity.dk> wrote:
Hi,

As a regular of user of dump files I would not want a "fancy" file format with indexes stored as trees etc.

I parse all the dump files (both for SQL tables and the XML files) with a one pass parser which inserts the data I want (which sometimes is only a small fraction of the total amount of data in the file) into my local database. I will normally never store uncompressed dump files, but pipe the uncompressed data directly from bunzip or gunzip to my parser to save disk space. Therefore it is important to me that the format is simple enough for a one pass parser.

I cannot really imagine who would use a library with object oriented API to read dump files. No matter what it would be inefficient and have fewer features and possibilities than using a real database.

I could live with a binary format, but I have doubts if it is a good idea. It will be harder to take sure that your parser is working correctly, and you have to consider things like endianness, size of integers, format of floats etc. which give no problems in text formats. The binary files may be smaller uncompressed (which I don't store anyway) but not necessary when compressed, as the compression will do better on text files.

Regards,
- Byrial


_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l