Ok, my 'reply all' is failing me in this mail user agent. Anyways, third time's a charm...
Sorry for e-mailing off list initially, and thanks for the reply. Had seen the multistream dumps but didn't know at all about those scripts.
Are you trying to let people update dumps of just the current revisions of pages (pages-articles), or the dumps with full edit histories (pages-meta-history) as well? On first read I thought you meant the former, then I thought the latter, now not sure.
Either way, you raise an interesting point about efficiently handling something like a bulk category update.
On Tue, Mar 26, 2013 at 1:54 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
Ok, my 'reply all' is failing me in this mail user agent. Anyways, third time's a charm...
---------- Forwarded message ---------- From: "Ariel T. Glenn" ariel@wikimedia.org To: Randall Farmer randall@wawd.com Cc: Date: Tue, 26 Mar 2013 09:33:25 +0200 Subject: Re: [Xmldatadumps-l] possible gsoc idea, comments? Woops, forgot to send this to the list. Also forgot to add the footnote so doing that.
A.
Στις 26-03-2013, ημέρα Τρι, και ώρα 09:18 +0200, ο/η Ariel T. Glenn έγραψε:
Στις 25-03-2013, ημέρα Δευ, και ώρα 23:36 -0700, ο/η Randall Farmer έγραψε:
This isn't exactly what you're looking for, but I've been playing around on my own time with how to keep a dump that's compressed but also allows some random access. Last weekend I ended up writing the attached script, which takes an XML file and makes a simple gzipped, indexed, sort-of-random-access dump:
- Each article is individually gzipped, then the files are
concatenated.
- gunzip -c [files] will still stream every page if your tools like
that.
- I split the dump into 8 files, matching the core count of the EC2
instance running the job.
- It generated a text index (title, redirect dest., gzip file number,
offset, length) you could load into memory or a database.
It took about 90 minutes for the gzipping/indexing, and the result was about 20 GB for enwiki. I used gzip compression level 1, because I was impatient. :)
I can share an EC2 disk snapshot with the actual dump reformatted this way, if that's at all interesting to you.
This was the idea behind the bz2 multistream dumps and the associated python scripts for poking around in them [1]. I made a different choice in the tradeoff between space and performance, compressing 100 pages together.
It would be nice to have a place we could put community-generated data (new formats, subsets of the data, etc) for other folks to re-use. Maybe we could convince a mirror site to host such things. (Any takers?)
I haven't written anything that tries to update, but:
- You could replace an article by just appending the compressed
version somewhere and updating the index. (For convenience, you might want to write a list of 'holes' (byte-ranges for replaced articles) somewhere.)
I assume we need indexing for fast rerieval.
- You could truly delete an old revision by writing zeroes over it, if
that's a concern.
Yes, we would have to actually delete content (includes the title and associated information, if the page has been deleted). This means that we won't have anything like 'undeletion', we'll just be adding seemingly new page contents if a page is restored.
- Once you do that, streaming all the XML requires a script smart
enough to skip the holes.
Yes, we need a script that will not only skip the holes but write out the pages and revisions 'in canonical order', which, if free blocks are reused, might differ considerably from the order they are stored in this new format.
If we compress multiple items (revision texts) togther in order to not completely lose on disk space, consider this scenario: a bot goes through and edits all aprticles in Category X moving them to Category Y. If we compress all revisions of a page together we are going to do a lot of uncompression in order to fold those changes in. This argues for compressing multiple revisions together ordered simply by revision id, to minimize the uncmpression and recompression of unaltered material during the update. But this needs more though.
- Once you have that script, you could use it to produce a
'cleaned' (i.e., holeless) copy of the dump periodically.
Like you say, better formats are possible: you could compress articles in batches, reuse holes for new blocks, etc. Updating an article becomes a less trivial operation when you do all of that. The point here is just that a relatively simple format can do a few of the key things you want.
FWIW, I think the really interesting part of your proposal might be how to package daily changes--filling the gaps in adds/changes dumps. Working on the parsing/indexing/compressing script, I sort of wrestled with whether what I was doing was substantially better than just loading the articles into a database as blobs. I'm still not sure. But more complete daily patch info is useful no matter how the data is stored.
Heh yes, I was hoping that we could make use of the adds/changes material somehow to get this going :-)
On the other hand, potentially in favor of making a dynamic dump format, it could be a platform for your "reference implementation" of an incremental update script. That is, Wikimedia publishes some scripts that will apply updates to a dump in a particular format, in hopes that random users can adapt them from updating a dump to updating a MySQL database, MongoDB cluster, or whatever other datastore they use.
This is harder because while we could possibly generate insert statements for the new rows in the page, revision and text tables and maybe even a series of delete statements, we can't do much for the other tables like pagelinks, categorylinks and so on. I haven't even tried to get my head around that problem, that's for further down the road.
Anyway, sorry for the scattered thoughts and hope some of this is useful or at least thought-provoking.
Randall
Keep those scattered thoughts coming!
Ariel
[1]
http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-October/000606.html
On Mon, Mar 25, 2013 at 4:22 AM, Ariel T. Glenn ariel@wikimedia.org wrote: So I was thinking about things I can't undertake, and one of those things is the 'dumps 2.0' which has been rolling around in the back of my mind. The TL;DR version is: sparse compressed archive format that allows folks to add/subtract changes to it random-access (including during generation).
See here:
https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_dum...
What do folks think? Workable? Nuts? Low priority? Interested? Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Στις 26-03-2013, ημέρα Τρι, και ώρα 11:42 -0700, ο/η Randall Farmer έγραψε:
Sorry for e-mailing off list initially, and thanks for the reply. Had seen the multistream dumps but didn't know at all about those scripts.
There aren't so many scripts just yet but I'm hoping people will be interested in writing more :-)
Are you trying to let people update dumps of just the current revisions of pages (pages-articles), or the dumps with full edit histories (pages-meta-history) as well? On first read I thought you meant the former, then I thought the latter, now not sure.
I was thinking of the full history dumps, which are the most time-consuming to produce, take forever to download, and are generally just a PITA. If we were going to introduce a new dump format I'd like to do it for those, and anything else could be discussed later.
Either way, you raise an interesting point about efficiently handling something like a bulk category update.
Category updates are a matter of shoveling in the latest category table produced along with all the other dump files, assuming that one's local copy of wp doesn't have editing going on. This isn't optimal but it's also a problem we can put off for another day.
Ariel
Especially with history dumps, I think it would make a lot of sense to use some kind of delta compression (like git's pack files do).
Delta compression was indeed on my mind when I wrote this description, but th devil is in the details :-)
Spent a little time looking at alternate compressors for revision histories. FWIW:
rzip impressed me. It compressed 15GB of dump to 44MB in three minutes: essentially the same ratio as .7z, 20x as fast (on the test VM, an EC2 c1.small). But, importantly, rzip only reads and write seekable files (no pipes).
xdelta3 can also be used as a compressor for long-range redundancy in files. It's fast and works with pipes, but doesn't get the ratio of 7z or rzip. It got the 15G sample down to 141M in 2.5 min, and (as xdelta -9 | xz -9) down to 91M in 4 min. I can see it being practical if, say, you have local dump data that you foresee having to edit/recompress every so often.
I know you aren't exactly looking at faster compression as the goal here, but thought the info might be useful to you or someone working in this space.
On Tue, Mar 26, 2013 at 11:42 AM, Randall Farmer randall@wawd.com wrote:
Sorry for e-mailing off list initially, and thanks for the reply. Had seen the multistream dumps but didn't know at all about those scripts.
Are you trying to let people update dumps of just the current revisions of pages (pages-articles), or the dumps with full edit histories (pages-meta-history) as well? On first read I thought you meant the former, then I thought the latter, now not sure.
Either way, you raise an interesting point about efficiently handling something like a bulk category update.
On Tue, Mar 26, 2013 at 1:54 AM, Ariel T. Glenn ariel@wikimedia.orgwrote:
Ok, my 'reply all' is failing me in this mail user agent. Anyways, third time's a charm...
---------- Forwarded message ---------- From: "Ariel T. Glenn" ariel@wikimedia.org To: Randall Farmer randall@wawd.com Cc: Date: Tue, 26 Mar 2013 09:33:25 +0200 Subject: Re: [Xmldatadumps-l] possible gsoc idea, comments? Woops, forgot to send this to the list. Also forgot to add the footnote so doing that.
A.
Στις 26-03-2013, ημέρα Τρι, και ώρα 09:18 +0200, ο/η Ariel T. Glenn έγραψε:
Στις 25-03-2013, ημέρα Δευ, και ώρα 23:36 -0700, ο/η Randall Farmer έγραψε:
This isn't exactly what you're looking for, but I've been playing around on my own time with how to keep a dump that's compressed but also allows some random access. Last weekend I ended up writing the attached script, which takes an XML file and makes a simple gzipped, indexed, sort-of-random-access dump:
- Each article is individually gzipped, then the files are
concatenated.
- gunzip -c [files] will still stream every page if your tools like
that.
- I split the dump into 8 files, matching the core count of the EC2
instance running the job.
- It generated a text index (title, redirect dest., gzip file number,
offset, length) you could load into memory or a database.
It took about 90 minutes for the gzipping/indexing, and the result was about 20 GB for enwiki. I used gzip compression level 1, because I was impatient. :)
I can share an EC2 disk snapshot with the actual dump reformatted this way, if that's at all interesting to you.
This was the idea behind the bz2 multistream dumps and the associated python scripts for poking around in them [1]. I made a different choice in the tradeoff between space and performance, compressing 100 pages together.
It would be nice to have a place we could put community-generated data (new formats, subsets of the data, etc) for other folks to re-use. Maybe we could convince a mirror site to host such things. (Any takers?)
I haven't written anything that tries to update, but:
- You could replace an article by just appending the compressed
version somewhere and updating the index. (For convenience, you might want to write a list of 'holes' (byte-ranges for replaced articles) somewhere.)
I assume we need indexing for fast rerieval.
- You could truly delete an old revision by writing zeroes over it, if
that's a concern.
Yes, we would have to actually delete content (includes the title and associated information, if the page has been deleted). This means that we won't have anything like 'undeletion', we'll just be adding seemingly new page contents if a page is restored.
- Once you do that, streaming all the XML requires a script smart
enough to skip the holes.
Yes, we need a script that will not only skip the holes but write out the pages and revisions 'in canonical order', which, if free blocks are reused, might differ considerably from the order they are stored in this new format.
If we compress multiple items (revision texts) togther in order to not completely lose on disk space, consider this scenario: a bot goes through and edits all aprticles in Category X moving them to Category Y. If we compress all revisions of a page together we are going to do a lot of uncompression in order to fold those changes in. This argues for compressing multiple revisions together ordered simply by revision id, to minimize the uncmpression and recompression of unaltered material during the update. But this needs more though.
- Once you have that script, you could use it to produce a
'cleaned' (i.e., holeless) copy of the dump periodically.
Like you say, better formats are possible: you could compress articles in batches, reuse holes for new blocks, etc. Updating an article becomes a less trivial operation when you do all of that. The point here is just that a relatively simple format can do a few of the key things you want.
FWIW, I think the really interesting part of your proposal might be how to package daily changes--filling the gaps in adds/changes dumps. Working on the parsing/indexing/compressing script, I sort of wrestled with whether what I was doing was substantially better than just loading the articles into a database as blobs. I'm still not sure. But more complete daily patch info is useful no matter how the data is stored.
Heh yes, I was hoping that we could make use of the adds/changes material somehow to get this going :-)
On the other hand, potentially in favor of making a dynamic dump format, it could be a platform for your "reference implementation" of an incremental update script. That is, Wikimedia publishes some scripts that will apply updates to a dump in a particular format, in hopes that random users can adapt them from updating a dump to updating a MySQL database, MongoDB cluster, or whatever other datastore they use.
This is harder because while we could possibly generate insert statements for the new rows in the page, revision and text tables and maybe even a series of delete statements, we can't do much for the other tables like pagelinks, categorylinks and so on. I haven't even tried to get my head around that problem, that's for further down the road.
Anyway, sorry for the scattered thoughts and hope some of this is useful or at least thought-provoking.
Randall
Keep those scattered thoughts coming!
Ariel
[1]
http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-October/000606.html
On Mon, Mar 25, 2013 at 4:22 AM, Ariel T. Glenn ariel@wikimedia.org wrote: So I was thinking about things I can't undertake, and one of those things is the 'dumps 2.0' which has been rolling around in the back of my mind. The TL;DR version is: sparse compressed archive format that allows folks to add/subtract changes to it random-access (including during generation).
See here:
https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_dum...
What do folks think? Workable? Nuts? Low priority? Interested? Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
I attached a Python script that compresses/decompresses files in 10MB chunks, and stores info about block boundaries so you can read random parts of the file. It's set up to use rzip or xdelta3 (a different package from xdelta) for compression, so you'll want one or both. It's public domain, with no warranty. No docs either; the command-line syntax tries to be like gzip's.
Some sample invocations of the script with timing and output-size info are below my signature.
The list of caveats could be about as long as the script--performance, hackability/readability, ease of installation, and flexibility are all suboptimal. At a minimum, it's not safe for exchanging files without at least 1) making the way it reads/writes binary numbers CPU-architecture-independent (Python's array.array('l').tostring() is not), 2) adding a magic number, file format version, and compression-type tag, so format/algorithm can be upgraded gracefully, 3) better handling errors, like the ones you get when rzip or xdelta3 are not installed, 4) testing. The -blksdev filename suffix it uses reflects that it's a development format, not production-ready.
Last post I said xdelta3 and rzip compressed histories pretty well pretty quickly, but didn't expand (ha!) on that at all. Both programs have have a first stage that quickly compresses long repetitions like you'll see in history files, at the cost of completely missing short-range redundancy. Then, rzip uses bzip2 and xdelta3 can use its own compressor for short-range redundancy. Neither adds much value if your file doesn't have large long-range repetitions, which is why you don't hear about them as general-purpose compressors often.
Honestly don't know if anything down this path will suit your needs. Certainly this exact script doesn't--just seemed like an interesting thing to mess around with.
Best, Randall
The original file is enwiki-20130304-pages-meta-history22.xml-p018183504p018225000, 4.1G of history goodness.
File sizes and compression times-- 50M rz-blks 2m17s 76M xd3djw-blks 1m47s 89M xd3-blks 1m38s *'s are for the two options I find most interesting. Note you can stream data to/from blks.py, even though you can't stream to/from rzip.
For comparison, times and sizes without blocks: 39M 7z 16m21s 39M rz 1m12s (1m5s slower and 11M bigger with blocks) 80M xd3 1m (38s slower and 9M bigger with blocks)
Specific command lines and timings: # Compress using rzip # -f = force overwrite of any existing file # -k = keep original $ time ~/blks.py -fk enwiki-20130304-pages-meta-history22.xml-p018183504p018225000
real 2m17.508s user 0m58.584s sys 0m55.983s
# Compress using xdelta3 -S djw # -p picks your compressor time ~/blks.py -pxd3djw -fk enwiki-20130304-pages-meta-history22.xml-p018183504p018225000
real 1m47.528s user 0m39.402s sys 0m44.051s
# Get some content crossing a 10M block boundary and make sure it matches the original # --skip and --length say what part to read # -d means to decompress # -c means write to stdout $ time ~/blks.py -dc --skip=9900000 --length=200000 enwiki-20130304-pages-meta-history22.xml-p018183504p018225000.rz-blks | md5sum 02122d6b4dc678ca138b680edd3b0067 - real 0m0.358s $ head --bytes=10100000 enwiki-20130304-pages-meta-history22.xml-p018183504p018225000 | tail --bytes=200000 | md5sum 02122d6b4dc678ca138b680edd3b0067 -
# Decompress everything to stdout, and md5sum $ time ~/blks.py -dc enwiki-20130304-pages-meta-history22.xml-p018183504p018225000.rz-blks | md5sum 10fa39684af636b55c4cb1649359ead5 - real 1m22.579s $ time md5sum enwiki-20130304-pages-meta-history22.xml-p018183504p018225000 10fa39684af636b55c4cb1649359ead5 enwiki-20130304-pages-meta-history22.xml-p018183504p018225000 real 0m36.323s
To wrap up what I started earlier, here's a slightly tweaked copy of the last script I sent around: basic changes to make it no longer completely unusable (uses rzip only, container format always uses network byte order, header indicating file type). It also has a (slow) Python decompressor for the rzip format that it uses if the binary isn't installed, written while I was trying to understand rzip a bit better.
Just compressing chunks of rev. history like this probably isn't a great substitute for delta coding--even if more implementation quirks were ironed out, there still isn't a way to add revs w/out recompressing unrelated content; also, an efficient delta coder could definitely eat less memory than this (it would only have to keep a couplefew revisions in RAM at a time, not 10MB).
This does at least show that some relatively blunt efforts to efficiently compress redundancy between revisions can save time and space (vs. bzip2), or save time spent on compression at relatively little space cost (vs. 7zip). If anybody has questions or ideas, happy to talk or try to help. But, all that said, declaring blks2.py a (kinda fun to work on!) dead end. :)
Randall
Randall Farmer, 06/05/2013 08:37:
To wrap up what I started earlier, here's a slightly tweaked copy of the last script I sent around [...] But, all that said, declaring blks2.py a (kinda fun to work on!) dead end. :)
If you're done with it, you may want to drop it on a Wikimedia repo like https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=toys;h=0974d59e573fd5bceb76ec93878471bc11f6430c;hb=119d99131f2cf692819422ad5e516c49d935a504 or whatever, just so that it's not only a mail attachment. I also copied some short info you sent earlier to https://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3 for lack of better existing pages (?).
Nemo
Sure.
On Mon, May 6, 2013 at 12:06 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Randall Farmer, 06/05/2013 08:37:
To wrap up what I started earlier, here's a slightly tweaked copy of the last script I sent around [...] But, all that said, declaring blks2.py a (kinda
fun to work on!) dead end. :)
If you're done with it, you may want to drop it on a Wikimedia repo like < https://gerrit.wikimedia.org/**r/gitweb?p=operations/dumps.** git;a=tree;f=toys;h=**0974d59e573fd5bceb76ec93878471**bc11f6430c;hb=** 119d99131f2cf692819422ad5e516c**49d935a504https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=toys;h=0974d59e573fd5bceb76ec93878471bc11f6430c;hb=119d99131f2cf692819422ad5e516c49d935a504> or whatever, just so that it's not only a mail attachment. I also copied some short info you sent earlier to https://www.mediawiki.org/**wiki/Dbzip2#rzip_and_xdelta3https://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3for lack of better existing pages (?).
Nemo
you may want to drop it on a Wikimedia repo
This is done (https://gerrit.wikimedia.org/r/#/c/63139/). Besides the rzip script, there's another one that does a simple dedupe of lines of text repeated between revs, then gzips. It's slower than rzip at any given compression level, but still faster and smaller than straight bzip (in a test on 4G of enwiki, 50% smaller and 10x faster) and anyone with Python can run it.
Again, if there's any lesson it's just that there are some gains from making even pretty naive attempts to compress redundancy between revisions. Interested to see what the summer dumps project produces.
I've also expanded the braindump about this stuff on the dbzip2 page you linked to.
On Mon, May 6, 2013 at 9:02 AM, Randall Farmer randall@wawd.com wrote:
Sure.
On Mon, May 6, 2013 at 12:06 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Randall Farmer, 06/05/2013 08:37:
To wrap up what I started earlier, here's a slightly tweaked copy of the last script I sent around [...] But, all that said, declaring blks2.py a (kinda
fun to work on!) dead end. :)
If you're done with it, you may want to drop it on a Wikimedia repo like < https://gerrit.wikimedia.org/**r/gitweb?p=operations/dumps.** git;a=tree;f=toys;h=**0974d59e573fd5bceb76ec93878471**bc11f6430c;hb=** 119d99131f2cf692819422ad5e516c**49d935a504https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=toys;h=0974d59e573fd5bceb76ec93878471bc11f6430c;hb=119d99131f2cf692819422ad5e516c49d935a504> or whatever, just so that it's not only a mail attachment. I also copied some short info you sent earlier to https://www.mediawiki.org/**wiki/Dbzip2#rzip_and_xdelta3https://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3for lack of better existing pages (?).
Nemo
xmldatadumps-l@lists.wikimedia.org