Especially with history dumps, I think it would make a lot of sense to use some kind of delta compression (like git's pack files do).
Delta compression was indeed on my mind when I wrote this description, but th devil is in the details :-)
Spent a little time looking at alternate compressors for revision histories. FWIW:
rzip impressed me. It compressed 15GB of dump to 44MB in three minutes: essentially the same ratio as .7z, 20x as fast (on the test VM, an EC2 c1.small). But, importantly, rzip only reads and write seekable files (no pipes).
xdelta3 can also be used as a compressor for long-range redundancy in files. It's fast and works with pipes, but doesn't get the ratio of 7z or rzip. It got the 15G sample down to 141M in 2.5 min, and (as xdelta -9 | xz -9) down to 91M in 4 min. I can see it being practical if, say, you have local dump data that you foresee having to edit/recompress every so often.
I know you aren't exactly looking at faster compression as the goal here, but thought the info might be useful to you or someone working in this space.
On Tue, Mar 26, 2013 at 11:42 AM, Randall Farmer randall@wawd.com wrote:
Sorry for e-mailing off list initially, and thanks for the reply. Had seen the multistream dumps but didn't know at all about those scripts.
Are you trying to let people update dumps of just the current revisions of pages (pages-articles), or the dumps with full edit histories (pages-meta-history) as well? On first read I thought you meant the former, then I thought the latter, now not sure.
Either way, you raise an interesting point about efficiently handling something like a bulk category update.
On Tue, Mar 26, 2013 at 1:54 AM, Ariel T. Glenn ariel@wikimedia.orgwrote:
Ok, my 'reply all' is failing me in this mail user agent. Anyways, third time's a charm...
---------- Forwarded message ---------- From: "Ariel T. Glenn" ariel@wikimedia.org To: Randall Farmer randall@wawd.com Cc: Date: Tue, 26 Mar 2013 09:33:25 +0200 Subject: Re: [Xmldatadumps-l] possible gsoc idea, comments? Woops, forgot to send this to the list. Also forgot to add the footnote so doing that.
A.
Στις 26-03-2013, ημέρα Τρι, και ώρα 09:18 +0200, ο/η Ariel T. Glenn έγραψε:
Στις 25-03-2013, ημέρα Δευ, και ώρα 23:36 -0700, ο/η Randall Farmer έγραψε:
This isn't exactly what you're looking for, but I've been playing around on my own time with how to keep a dump that's compressed but also allows some random access. Last weekend I ended up writing the attached script, which takes an XML file and makes a simple gzipped, indexed, sort-of-random-access dump:
- Each article is individually gzipped, then the files are
concatenated.
- gunzip -c [files] will still stream every page if your tools like
that.
- I split the dump into 8 files, matching the core count of the EC2
instance running the job.
- It generated a text index (title, redirect dest., gzip file number,
offset, length) you could load into memory or a database.
It took about 90 minutes for the gzipping/indexing, and the result was about 20 GB for enwiki. I used gzip compression level 1, because I was impatient. :)
I can share an EC2 disk snapshot with the actual dump reformatted this way, if that's at all interesting to you.
This was the idea behind the bz2 multistream dumps and the associated python scripts for poking around in them [1]. I made a different choice in the tradeoff between space and performance, compressing 100 pages together.
It would be nice to have a place we could put community-generated data (new formats, subsets of the data, etc) for other folks to re-use. Maybe we could convince a mirror site to host such things. (Any takers?)
I haven't written anything that tries to update, but:
- You could replace an article by just appending the compressed
version somewhere and updating the index. (For convenience, you might want to write a list of 'holes' (byte-ranges for replaced articles) somewhere.)
I assume we need indexing for fast rerieval.
- You could truly delete an old revision by writing zeroes over it, if
that's a concern.
Yes, we would have to actually delete content (includes the title and associated information, if the page has been deleted). This means that we won't have anything like 'undeletion', we'll just be adding seemingly new page contents if a page is restored.
- Once you do that, streaming all the XML requires a script smart
enough to skip the holes.
Yes, we need a script that will not only skip the holes but write out the pages and revisions 'in canonical order', which, if free blocks are reused, might differ considerably from the order they are stored in this new format.
If we compress multiple items (revision texts) togther in order to not completely lose on disk space, consider this scenario: a bot goes through and edits all aprticles in Category X moving them to Category Y. If we compress all revisions of a page together we are going to do a lot of uncompression in order to fold those changes in. This argues for compressing multiple revisions together ordered simply by revision id, to minimize the uncmpression and recompression of unaltered material during the update. But this needs more though.
- Once you have that script, you could use it to produce a
'cleaned' (i.e., holeless) copy of the dump periodically.
Like you say, better formats are possible: you could compress articles in batches, reuse holes for new blocks, etc. Updating an article becomes a less trivial operation when you do all of that. The point here is just that a relatively simple format can do a few of the key things you want.
FWIW, I think the really interesting part of your proposal might be how to package daily changes--filling the gaps in adds/changes dumps. Working on the parsing/indexing/compressing script, I sort of wrestled with whether what I was doing was substantially better than just loading the articles into a database as blobs. I'm still not sure. But more complete daily patch info is useful no matter how the data is stored.
Heh yes, I was hoping that we could make use of the adds/changes material somehow to get this going :-)
On the other hand, potentially in favor of making a dynamic dump format, it could be a platform for your "reference implementation" of an incremental update script. That is, Wikimedia publishes some scripts that will apply updates to a dump in a particular format, in hopes that random users can adapt them from updating a dump to updating a MySQL database, MongoDB cluster, or whatever other datastore they use.
This is harder because while we could possibly generate insert statements for the new rows in the page, revision and text tables and maybe even a series of delete statements, we can't do much for the other tables like pagelinks, categorylinks and so on. I haven't even tried to get my head around that problem, that's for further down the road.
Anyway, sorry for the scattered thoughts and hope some of this is useful or at least thought-provoking.
Randall
Keep those scattered thoughts coming!
Ariel
[1]
http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-October/000606.html
On Mon, Mar 25, 2013 at 4:22 AM, Ariel T. Glenn ariel@wikimedia.org wrote: So I was thinking about things I can't undertake, and one of those things is the 'dumps 2.0' which has been rolling around in the back of my mind. The TL;DR version is: sparse compressed archive format that allows folks to add/subtract changes to it random-access (including during generation).
See here:
https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_dum...
What do folks think? Workable? Nuts? Low priority? Interested? Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l