Ok, my 'reply all' is failing me in this mail user agent. Anyways, third
time's a charm...
---------- Forwarded message ----------
From: "Ariel T. Glenn" <ariel@wikimedia.org>
To: Randall Farmer <randall@wawd.com>
Cc:
Date: Tue, 26 Mar 2013 09:33:25 +0200
Subject: Re: [Xmldatadumps-l] possible gsoc idea, comments?
Woops, forgot to send this to the list. Also forgot to add the footnote
so doing that.
A.
Στις 26-03-2013, ημέρα Τρι, και ώρα 09:18 +0200, ο/η Ariel T. Glenn
έγραψε:
> Στις 25-03-2013, ημέρα Δευ, και ώρα 23:36 -0700, ο/η Randall Farmer
> έγραψε:
> > This isn't exactly what you're looking for, but I've been playing
> > around on my own time with how to keep a dump that's compressed but
> > also allows some random access. Last weekend I ended up writing the
> > attached script, which takes an XML file and makes a simple gzipped,
> > indexed, sort-of-random-access dump:
> >
> >
> > - Each article is individually gzipped, then the files are
> > concatenated.
> > - gunzip -c [files] will still stream every page if your tools like
> > that.
> > - I split the dump into 8 files, matching the core count of the EC2
> > instance running the job.
> > - It generated a text index (title, redirect dest., gzip file number,
> > offset, length) you could load into memory or a database.
> >
> >
> >
> > It took about 90 minutes for the gzipping/indexing, and the result was
> > about 20 GB for enwiki. I used gzip compression level 1, because I was
> > impatient. :)
> >
> >
> > I can share an EC2 disk snapshot with the actual dump reformatted this
> > way, if that's at all interesting to you.
>
> This was the idea behind the bz2 multistream dumps and the associated
> python scripts for poking around in them [1]. I made a different choice
> in the tradeoff between space and performance, compressing 100 pages
> together.
>
> It would be nice to have a place we could put community-generated data
> (new formats, subsets of the data, etc) for other folks to re-use. Maybe
> we could convince a mirror site to host such things. (Any takers?)
>
> > I haven't written anything that tries to update, but:
> >
> >
> > - You could replace an article by just appending the compressed
> > version somewhere and updating the index.
> > (For convenience, you might want to write a list of
> > 'holes' (byte-ranges for replaced articles) somewhere.)
>
> I assume we need indexing for fast rerieval.
>
> > - You could truly delete an old revision by writing zeroes over it, if
> > that's a concern.
>
> Yes, we would have to actually delete content (includes the title and
> associated information, if the page has been deleted). This means that
> we won't have anything like 'undeletion', we'll just be adding seemingly
> new page contents if a page is restored.
>
> > - Once you do that, streaming all the XML requires a script smart
> > enough to skip the holes.
>
> Yes, we need a script that will not only skip the holes but write out
> the pages and revisions 'in canonical order', which, if free blocks are
> reused, might differ considerably from the order they are stored in this
> new format.
>
> If we compress multiple items (revision texts) togther in order to not
> completely lose on disk space, consider this scenario: a bot goes
> through and edits all aprticles in Category X moving them to Category Y.
> If we compress all revisions of a page together we are going to do a lot
> of uncompression in order to fold those changes in. This argues for
> compressing multiple revisions together ordered simply by revision id,
> to minimize the uncmpression and recompression of unaltered material
> during the update. But this needs more though.
>
> > - Once you have that script, you could use it to produce a
> > 'cleaned' (i.e., holeless) copy of the dump periodically.
> >
> > Like you say, better formats are possible: you could compress articles
> > in batches, reuse holes for new blocks, etc. Updating an article
> > becomes a less trivial operation when you do all of that. The point
> > here is just that a relatively simple format can do a few of the key
> > things you want.
> >
> >
> >
> > FWIW, I think the really interesting part of your proposal might be
> > how to package daily changes--filling the gaps in adds/changes dumps.
> > Working on the parsing/indexing/compressing script, I sort of wrestled
> > with whether what I was doing was substantially better than just
> > loading the articles into a database as blobs. I'm still not sure. But
> > more complete daily patch info is useful no matter how the data is
> > stored.
>
> Heh yes, I was hoping that we could make use of the adds/changes
> material somehow to get this going :-)
>
> >
> > On the other hand, potentially in favor of making a dynamic dump
> > format, it could be a platform for your "reference implementation" of
> > an incremental update script. That is, Wikimedia publishes some
> > scripts that will apply updates to a dump in a particular format, in
> > hopes that random users can adapt them from updating a dump to
> > updating a MySQL database, MongoDB cluster, or whatever other
> > datastore they use.
> >
>
> This is harder because while we could possibly generate insert
> statements for the new rows in the page, revision and text tables and
> maybe even a series of delete statements, we can't do much for the other
> tables like pagelinks, categorylinks and so on. I haven't even tried to
> get my head around that problem, that's for further down the road.
>
> > Anyway, sorry for the scattered thoughts and hope some of this is
> > useful or at least thought-provoking.
> >
> >
> > Randall
Keep those scattered thoughts coming!
Ariel
[1]
http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-October/000606.html
> > On Mon, Mar 25, 2013 at 4:22 AM, Ariel T. Glenn <ariel@wikimedia.org>
> > wrote:
> > So I was thinking about things I can't undertake, and one of
> > those
> > things is the 'dumps 2.0' which has been rolling around in the
> > back of
> > my mind. The TL;DR version is: sparse compressed archive
> > format that
> > allows folks to add/subtract changes to it random-access
> > (including
> > during generation).
> >
> > See here:
> >
> > https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_dumps
> >
> > What do folks think? Workable? Nuts? Low priority? Interested?
> >
> > Ariel
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l