Ok, my 'reply all' is failing me in this mail
user agent. Anyways, third
time's a charm...
---------- Forwarded message ----------
From: "Ariel T. Glenn" <ariel(a)wikimedia.org>
To: Randall Farmer <randall(a)wawd.com>
Date: Tue, 26 Mar 2013 09:33:25 +0200
Subject: Re: [Xmldatadumps-l] possible gsoc idea, comments?
Woops, forgot to send this to the list. Also forgot to add the footnote
so doing that.
Στις 26-03-2013, ημέρα Τρι, και ώρα 09:18 +0200, ο/η Ariel T. Glenn
Στις 25-03-2013, ημέρα Δευ, και ώρα 23:36 -0700,
ο/η Randall Farmer
This isn't exactly what you're looking
for, but I've been playing
around on my own time with how to keep a dump that's compressed but
also allows some random access. Last weekend I ended up writing the
attached script, which takes an XML file and makes a simple gzipped,
indexed, sort-of-random-access dump:
- Each article is individually gzipped, then the files are
- gunzip -c [files] will still stream every page if your tools like
- I split the dump into 8 files, matching the core count of the EC2
instance running the job.
- It generated a text index (title, redirect dest., gzip file number,
offset, length) you could load into memory or a database.
It took about 90 minutes for the gzipping/indexing, and the result was
about 20 GB for enwiki. I used gzip compression level 1, because I was
I can share an EC2 disk snapshot with the actual dump reformatted this
way, if that's at all interesting to you.
This was the idea behind the bz2 multistream dumps and the associated
python scripts for poking around in them . I made a different choice
in the tradeoff between space and performance, compressing 100 pages
It would be nice to have a place we could put community-generated data
(new formats, subsets of the data, etc) for other folks to re-use. Maybe
we could convince a mirror site to host such things. (Any takers?)
I haven't written anything that tries to
- You could replace an article by just appending the compressed
version somewhere and updating the index.
(For convenience, you might want to write a list of
'holes' (byte-ranges for replaced articles) somewhere.)
I assume we need indexing for fast rerieval.
- You could truly delete an old revision by
writing zeroes over it, if
that's a concern.
Yes, we would have to actually delete content (includes the title and
associated information, if the page has been deleted). This means that
we won't have anything like 'undeletion', we'll just be adding seemingly
new page contents if a page is restored.
- Once you do that, streaming all the XML
requires a script smart
enough to skip the holes.
Yes, we need a script that will not only skip the holes but write out
the pages and revisions 'in canonical order', which, if free blocks are
reused, might differ considerably from the order they are stored in this
If we compress multiple items (revision texts) togther in order to not
completely lose on disk space, consider this scenario: a bot goes
through and edits all aprticles in Category X moving them to Category Y.
If we compress all revisions of a page together we are going to do a lot
of uncompression in order to fold those changes in. This argues for
compressing multiple revisions together ordered simply by revision id,
to minimize the uncmpression and recompression of unaltered material
during the update. But this needs more though.
- Once you have that script, you could use it to
'cleaned' (i.e., holeless) copy of the dump periodically.
Like you say, better formats are possible: you could compress articles
in batches, reuse holes for new blocks, etc. Updating an article
becomes a less trivial operation when you do all of that. The point
here is just that a relatively simple format can do a few of the key
things you want.
FWIW, I think the really interesting part of your proposal might be
how to package daily changes--filling the gaps in adds/changes dumps.
Working on the parsing/indexing/compressing script, I sort of wrestled
with whether what I was doing was substantially better than just
loading the articles into a database as blobs. I'm still not sure. But
more complete daily patch info is useful no matter how the data is
Heh yes, I was hoping that we could make use of the adds/changes
material somehow to get this going :-)
On the other hand, potentially in favor of making a dynamic dump
format, it could be a platform for your "reference implementation" of
an incremental update script. That is, Wikimedia publishes some
scripts that will apply updates to a dump in a particular format, in
hopes that random users can adapt them from updating a dump to
updating a MySQL database, MongoDB cluster, or whatever other
datastore they use.
This is harder because while we could possibly generate insert
statements for the new rows in the page, revision and text tables and
maybe even a series of delete statements, we can't do much for the other
tables like pagelinks, categorylinks and so on. I haven't even tried to
get my head around that problem, that's for further down the road.
> Anyway, sorry for the scattered thoughts and hope some of this is
> useful or at least thought-provoking.
> On Mon, Mar 25, 2013 at 4:22 AM, Ariel T.
> So I was thinking about things I can't undertake, and one of
> things is the 'dumps 2.0' which has been rolling around in the
> back of
> my mind. The TL;DR version is: sparse compressed archive
> format that
> allows folks to add/subtract changes to it random-access
> during generation).
> See here:
> What do folks think? Workable? Nuts? Low priority? Interested?
Xmldatadumps-l mailing list