Re: [Xmldatadumps-l] [Fwd: Re: possible gsoc idea, comments?]

26 Mar 2013

      Sorry for e-mailing off list initially, and thanks for the reply. Had seen
the multistream dumps but didn't know at all about those scripts.
Are you trying to let people update dumps of just the current revisions of
pages (pages-articles), or the dumps with full edit histories
(pages-meta-history) as well? On first read I thought you meant the former,
then I thought the latter, now not sure.
Either way, you raise an interesting point about efficiently handling
something like a bulk category update.
On Tue, Mar 26, 2013 at 1:54 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
...
Ok, my 'reply all' is failing me in this mail user agent. Anyways, third
time's a charm...
---------- Forwarded message ----------
From: "Ariel T. Glenn" ariel@wikimedia.org
To: Randall Farmer randall@wawd.com
Cc:
Date: Tue, 26 Mar 2013 09:33:25 +0200
Subject: Re: [Xmldatadumps-l] possible gsoc idea, comments?
Woops, forgot to send this to the list.  Also forgot to add the footnote
so doing that.
A.
Στις 26-03-2013, ημέρα Τρι, και ώρα 09:18 +0200, ο/η Ariel T. Glenn
έγραψε:
...
Στις 25-03-2013, ημέρα Δευ, και ώρα 23:36 -0700, ο/η Randall Farmer
έγραψε:
...
This isn't exactly what you're looking for, but I've been playing
around on my own time with how to keep a dump that's compressed but
also allows some random access. Last weekend I ended up writing the
attached script, which takes an XML file and makes a simple gzipped,
indexed, sort-of-random-access dump:

Each article is individually gzipped, then the files are

concatenated.

gunzip -c [files] will still stream every page if your tools like

that.

I split the dump into 8 files, matching the core count of the EC2

instance running the job.

It generated a text index (title, redirect dest., gzip file number,

offset, length) you could load into memory or a database.
It took about 90 minutes for the gzipping/indexing, and the result was
about 20 GB for enwiki. I used gzip compression level 1, because I was
impatient. :)
I can share an EC2 disk snapshot with the actual dump reformatted this
way, if that's at all interesting to you.
This was the idea behind the bz2 multistream dumps and the associated
python scripts for poking around in them [1].  I made a different choice
in the tradeoff between space and performance, compressing 100 pages
together.
It would be nice to have a place we could put community-generated data
(new formats, subsets of the data, etc) for other folks to re-use. Maybe
we could convince a mirror site to host such things.  (Any takers?)
...
I haven't written anything that tries to update, but:

You could replace an article by just appending the compressed

version somewhere and updating the index.
  (For convenience, you might want to write a list of
'holes' (byte-ranges for replaced articles) somewhere.)
I assume we need indexing for fast rerieval.
...

You could truly delete an old revision by writing zeroes over it, if

that's a concern.
Yes, we would have to actually delete content (includes the title and
associated information, if the page has been deleted).  This means that
we won't have anything like 'undeletion', we'll just be adding seemingly
new page contents if a page is restored.
...

Once you do that, streaming all the XML requires a script smart

enough to skip the holes.
Yes, we need a script that will not only skip the holes but write out
the pages and revisions 'in canonical order', which, if free blocks are
reused, might differ considerably from the order they are stored in this
new format.
If we compress multiple items (revision texts) togther in order to not
completely lose on disk space, consider this scenario:  a bot goes
through and edits all aprticles in Category X moving them to Category Y.
If we compress all revisions of a page together we are going to do a lot
of uncompression in order to fold those changes in.  This argues for
compressing multiple revisions together ordered simply by revision id,
to minimize the uncmpression and recompression of unaltered material
during the update.  But this needs more though.
...

Once you have that script, you could use it to produce a

'cleaned' (i.e., holeless) copy of the dump periodically.
Like you say, better formats are possible: you could compress articles
in batches, reuse holes for new blocks, etc. Updating an article
becomes a less trivial operation when you do all of that. The point
here is just that a relatively simple format can do a few of the key
things you want.
FWIW, I think the really interesting part of your proposal might be
how to package daily changes--filling the gaps in adds/changes dumps.
Working on the parsing/indexing/compressing script, I sort of wrestled
with whether what I was doing was substantially better than just
loading the articles into a database as blobs. I'm still not sure. But
more complete daily patch info is useful no matter how the data is
stored.
Heh yes, I was hoping that we could make use of the  adds/changes
material somehow to get this going :-)
...
On the other hand, potentially in favor of making a dynamic dump
format, it could be a platform for your "reference implementation" of
an incremental update script. That is, Wikimedia publishes some
scripts that will apply updates to a dump in a particular format, in
hopes that random users can adapt them from updating a dump to
updating a MySQL database, MongoDB cluster, or whatever other
datastore they use.
This is harder because while we could possibly generate insert
statements for the new rows in the page, revision and text tables and
maybe even a series of delete statements, we can't do much for the other
tables like pagelinks, categorylinks and so on.  I haven't even tried to
get my head around that problem, that's for further down the road.
...
Anyway, sorry for the scattered thoughts and hope some of this is
useful or at least thought-provoking.
Randall
Keep those scattered thoughts coming!
Ariel
[1]
http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-October/000606.html
...
...
On Mon, Mar 25, 2013 at 4:22 AM, Ariel T. Glenn ariel@wikimedia.org
wrote:
        So I was thinking about things I can't undertake, and one of
        those
        things is the 'dumps 2.0' which has been rolling around in the
        back of
        my mind.  The TL;DR version is: sparse compressed archive
        format that
        allows folks to add/subtract changes to it random-access
        (including
        during generation).
    See here:

https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_dum...
...
...
    What do folks think? Workable? Nuts? Low priority? Interested?

    Ariel

Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Xmldatadumps-l] [Fwd: Re: possible gsoc idea, comments?]