Sorry for e-mailing off list initially, and thanks for the reply. Had seen the multistream dumps but didn't know at all about those scripts.

Are you trying to let people update dumps of just the current revisions of pages (pages-articles), or the dumps with full edit histories (pages-meta-history) as well? On first read I thought you meant the former, then I thought the latter, now not sure.

Either way, you raise an interesting point about efficiently handling something like a bulk category update.


On Tue, Mar 26, 2013 at 1:54 AM, Ariel T. Glenn <ariel@wikimedia.org> wrote:
Ok, my 'reply all' is failing me in this mail user agent. Anyways, third
time's a charm...



---------- Forwarded message ----------
From: "Ariel T. Glenn" <ariel@wikimedia.org>
To: Randall Farmer <randall@wawd.com>
Cc: 
Date: Tue, 26 Mar 2013 09:33:25 +0200
Subject: Re: [Xmldatadumps-l] possible gsoc idea, comments?
Woops, forgot to send this to the list.  Also forgot to add the footnote
so doing that.

A.

Στις 26-03-2013, ημέρα Τρι, και ώρα 09:18 +0200, ο/η Ariel T. Glenn
έγραψε:
> Στις 25-03-2013, ημέρα Δευ, και ώρα 23:36 -0700, ο/η Randall Farmer
> έγραψε:
> > This isn't exactly what you're looking for, but I've been playing
> > around on my own time with how to keep a dump that's compressed but
> > also allows some random access. Last weekend I ended up writing the
> > attached script, which takes an XML file and makes a simple gzipped,
> > indexed, sort-of-random-access dump:
> >
> >
> > - Each article is individually gzipped, then the files are
> > concatenated.
> > - gunzip -c [files] will still stream every page if your tools like
> > that.
> > - I split the dump into 8 files, matching the core count of the EC2
> > instance running the job.
> > - It generated a text index (title, redirect dest., gzip file number,
> > offset, length) you could load into memory or a database.
> >
> >
> >
> > It took about 90 minutes for the gzipping/indexing, and the result was
> > about 20 GB for enwiki. I used gzip compression level 1, because I was
> > impatient. :)
> >
> >
> > I can share an EC2 disk snapshot with the actual dump reformatted this
> > way, if that's at all interesting to you.
>
> This was the idea behind the bz2 multistream dumps and the associated
> python scripts for poking around in them [1].  I made a different choice
> in the tradeoff between space and performance, compressing 100 pages
> together.
>
> It would be nice to have a place we could put community-generated data
> (new formats, subsets of the data, etc) for other folks to re-use. Maybe
> we could convince a mirror site to host such things.  (Any takers?)
>
> > I haven't written anything that tries to update, but:
> >
> >
> > - You could replace an article by just appending the compressed
> > version somewhere and updating the index.
> >   (For convenience, you might want to write a list of
> > 'holes' (byte-ranges for replaced articles) somewhere.)
>
> I assume we need indexing for fast rerieval.
>
> > - You could truly delete an old revision by writing zeroes over it, if
> > that's a concern.
>
> Yes, we would have to actually delete content (includes the title and
> associated information, if the page has been deleted).  This means that
> we won't have anything like 'undeletion', we'll just be adding seemingly
> new page contents if a page is restored.
>
> > - Once you do that, streaming all the XML requires a script smart
> > enough to skip the holes.
>
> Yes, we need a script that will not only skip the holes but write out
> the pages and revisions 'in canonical order', which, if free blocks are
> reused, might differ considerably from the order they are stored in this
> new format.
>
> If we compress multiple items (revision texts) togther in order to not
> completely lose on disk space, consider this scenario:  a bot goes
> through and edits all aprticles in Category X moving them to Category Y.
> If we compress all revisions of a page together we are going to do a lot
> of uncompression in order to fold those changes in.  This argues for
> compressing multiple revisions together ordered simply by revision id,
> to minimize the uncmpression and recompression of unaltered material
> during the update.  But this needs more though.
>
> > - Once you have that script, you could use it to produce a
> > 'cleaned' (i.e., holeless) copy of the dump periodically.
> >
> > Like you say, better formats are possible: you could compress articles
> > in batches, reuse holes for new blocks, etc. Updating an article
> > becomes a less trivial operation when you do all of that. The point
> > here is just that a relatively simple format can do a few of the key
> > things you want.
> >
> >
> >
> > FWIW, I think the really interesting part of your proposal might be
> > how to package daily changes--filling the gaps in adds/changes dumps.
> > Working on the parsing/indexing/compressing script, I sort of wrestled
> > with whether what I was doing was substantially better than just
> > loading the articles into a database as blobs. I'm still not sure. But
> > more complete daily patch info is useful no matter how the data is
> > stored.
>
> Heh yes, I was hoping that we could make use of the  adds/changes
> material somehow to get this going :-)
>
> >
> > On the other hand, potentially in favor of making a dynamic dump
> > format, it could be a platform for your "reference implementation" of
> > an incremental update script. That is, Wikimedia publishes some
> > scripts that will apply updates to a dump in a particular format, in
> > hopes that random users can adapt them from updating a dump to
> > updating a MySQL database, MongoDB cluster, or whatever other
> > datastore they use.
> >
>
> This is harder because while we could possibly generate insert
> statements for the new rows in the page, revision and text tables and
> maybe even a series of delete statements, we can't do much for the other
> tables like pagelinks, categorylinks and so on.  I haven't even tried to
> get my head around that problem, that's for further down the road.
>
> > Anyway, sorry for the scattered thoughts and hope some of this is
> > useful or at least thought-provoking.
> >
> >
> > Randall

Keep those scattered thoughts coming!

Ariel

[1]
http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-October/000606.html

> > On Mon, Mar 25, 2013 at 4:22 AM, Ariel T. Glenn <ariel@wikimedia.org>
> > wrote:
> >         So I was thinking about things I can't undertake, and one of
> >         those
> >         things is the 'dumps 2.0' which has been rolling around in the
> >         back of
> >         my mind.  The TL;DR version is: sparse compressed archive
> >         format that
> >         allows folks to add/subtract changes to it random-access
> >         (including
> >         during generation).
> >
> >         See here:
> >
> >         https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_dumps
> >
> >         What do folks think? Workable? Nuts? Low priority? Interested?
> >
> >         Ariel



_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l