Re: [Xmldatadumps-l] [Fwd: Re: possible gsoc idea, comments?]

1 Apr 2013

...
  Especially with history dumps, I think it would make a
lot of sense to
 use some kind of delta compression (like git's pack files do). 
...
  Delta compression was indeed on my mind when I wrote
this description,
 but th devil is in the details :-) 
Spent a little time looking at alternate compressors for revision
histories. FWIW:

rzip impressed me. It compressed 15GB of dump to 44MB in three minutes:
essentially the same ratio as .7z, 20x as fast (on the test VM, an EC2
c1.small). But, importantly, rzip only reads and write seekable files (no
pipes).

xdelta3 can also be used as a compressor for long-range redundancy in
files. It's fast and works with pipes, but doesn't get the ratio of 7z or
rzip. It got the 15G sample down to 141M in 2.5 min, and (as xdelta -9 | xz
-9) down to 91M in 4 min. I can see it being practical if, say, you have
local dump data that you foresee having to edit/recompress every so often.

I know you aren't exactly looking at faster compression as the goal here,
but thought the info might be useful to you or someone working in this
space.

On Tue, Mar 26, 2013 at 11:42 AM, Randall Farmer &lt;randall(a)wawd.com&gt; wrote:

...
  Sorry for e-mailing off list initially, and thanks for
the reply. Had seen
 the multistream dumps but didn't know at all about those scripts.

 Are you trying to let people update dumps of just the current revisions of
 pages (pages-articles), or the dumps with full edit histories
 (pages-meta-history) as well? On first read I thought you meant the former,
 then I thought the latter, now not sure.

 Either way, you raise an interesting point about efficiently handling
 something like a bulk category update.

 On Tue, Mar 26, 2013 at 1:54 AM, Ariel T. Glenn &lt;ariel(a)wikimedia.org&gt;wrote;wrote:

  Ok, my 'reply all' is failing me in this
mail user agent. Anyways, third
 time's a charm...

 ---------- Forwarded message ----------
 From: "Ariel T. Glenn" &lt;ariel(a)wikimedia.org&gt;
 To: Randall Farmer &lt;randall(a)wawd.com&gt;
 Cc:
 Date: Tue, 26 Mar 2013 09:33:25 +0200
 Subject: Re: [Xmldatadumps-l] possible gsoc idea, comments?
 Woops, forgot to send this to the list.  Also forgot to add the footnote
 so doing that.

 A.

 Στις 26-03-2013, ημέρα Τρι, και ώρα 09:18 +0200, ο/η Ariel T. Glenn
 έγραψε:
  Στις 25-03-2013, ημέρα Δευ, και ώρα 23:36 -0700,
ο/η Randall Farmer
 έγραψε: 
   This
isn't exactly what you're looking for, but I've been playing
 around on my own time with how to keep a dump that's compressed but
 also allows some random access. Last weekend I ended up writing the
 attached script, which takes an XML file and makes a simple gzipped,
 indexed, sort-of-random-access dump:

 - Each article is individually gzipped, then the files are
 concatenated.
 - gunzip -c [files] will still stream every page if your tools like
 that.
 - I split the dump into 8 files, matching the core count of the EC2
 instance running the job.
 - It generated a text index (title, redirect dest., gzip file number,
 offset, length) you could load into memory or a database.

 It took about 90 minutes for the gzipping/indexing, and the result was
 about 20 GB for enwiki. I used gzip compression level 1, because I was
 impatient. :)

 I can share an EC2 disk snapshot with the actual dump reformatted this
 way, if that's at all interesting to you. 
 This was the idea behind the bz2 multistream dumps and the associated
 python scripts for poking around in them [1].  I made a different choice
 in the tradeoff between space and performance, compressing 100 pages
 together.

 It would be nice to have a place we could put community-generated data
 (new formats, subsets of the data, etc) for other folks to re-use. Maybe
 we could convince a mirror site to host such things.  (Any takers?)

  I haven't written anything that tries to
update, but:

 - You could replace an article by just appending the compressed
 version somewhere and updating the index.
   (For convenience, you might want to write a list of
 'holes' (byte-ranges for replaced articles) somewhere.) 
 I assume we need indexing for fast rerieval.

  - You could truly delete an old revision by
writing zeroes over it, if
 that's a concern. 
 Yes, we would have to actually delete content (includes the title and
 associated information, if the page has been deleted).  This means that
 we won't have anything like 'undeletion', we'll just be adding seemingly
 new page contents if a page is restored.

  - Once you do that, streaming all the XML
requires a script smart
 enough to skip the holes. 
 Yes, we need a script that will not only skip the holes but write out
 the pages and revisions 'in canonical order', which, if free blocks are
 reused, might differ considerably from the order they are stored in this
 new format.

 If we compress multiple items (revision texts) togther in order to not
 completely lose on disk space, consider this scenario:  a bot goes
 through and edits all aprticles in Category X moving them to Category Y.
 If we compress all revisions of a page together we are going to do a lot
 of uncompression in order to fold those changes in.  This argues for
 compressing multiple revisions together ordered simply by revision id,
 to minimize the uncmpression and recompression of unaltered material
 during the update.  But this needs more though.

  - Once you have that script, you could use it to
produce a
 'cleaned' (i.e., holeless) copy of the dump periodically.

 Like you say, better formats are possible: you could compress articles
 in batches, reuse holes for new blocks, etc. Updating an article
 becomes a less trivial operation when you do all of that. The point
 here is just that a relatively simple format can do a few of the key
 things you want.

 FWIW, I think the really interesting part of your proposal might be
 how to package daily changes--filling the gaps in adds/changes dumps.
 Working on the parsing/indexing/compressing script, I sort of wrestled
 with whether what I was doing was substantially better than just
 loading the articles into a database as blobs. I'm still not sure. But
 more complete daily patch info is useful no matter how the data is
 stored. 
 Heh yes, I was hoping that we could make use of the  adds/changes
 material somehow to get this going :-)

 On the other hand, potentially in favor of making a dynamic dump
 format, it could be a platform for your "reference implementation" of
 an incremental update script. That is, Wikimedia publishes some
 scripts that will apply updates to a dump in a particular format, in
 hopes that random users can adapt them from updating a dump to
 updating a MySQL database, MongoDB cluster, or whatever other
 datastore they use.

 This is harder because while we could possibly generate insert
 statements for the new rows in the page, revision and text tables and
 maybe even a series of delete statements, we can't do much for the other
 tables like pagelinks, categorylinks and so on.  I haven't even tried to
 get my head around that problem, that's for further down the road.

 > Anyway, sorry for the scattered thoughts and hope some of this is
 > useful or at least thought-provoking.
 >
 >
 > Randall 
 Keep those scattered thoughts coming!

 Ariel

 [1]

 http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-October/000606.html

  > On Mon, Mar 25, 2013 at 4:22 AM, Ariel T.
Glenn &lt;ariel(a)wikimedia.org&gt;
 > wrote:
 >         So I was thinking about things I can't undertake, and one of
 >         those
 >         things is the 'dumps 2.0' which has been rolling around in the
 >         back of
 >         my mind.  The TL;DR version is: sparse compressed archive
 >         format that
 >         allows folks to add/subtract changes to it random-access
 >         (including
 >         during generation).
 >
 >         See here:
 >
 > 
https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_du…
  >
 >         What do folks think? Workable? Nuts? Low priority? Interested?
 >
 >         Ariel 

 _______________________________________________
 Xmldatadumps-l mailing list
 Xmldatadumps-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] [Fwd: Re: possible gsoc idea, comments?]