Στις 25-03-2013, ημέρα Δευ, και ώρα 23:40 +0100, ο/η Petr Onderka
This sounds really interesting to me (as in, I would
consider applying for this project).
Do you think most of this should be written in PHP (since
dumpBackup.php is currently in PHP)?
Or could it be written in another language (most likely Python)?
Well my thought was that to the extent it could take output frmo
existing files (adds/changs dumps) it could be written in python or
another language. At least a first take at it wouldd be as a separate
toolset and not a part of MediaWiki core. Performance would be an issue
at least in part; we want users to be able to do routine things with the
new format without taking much of a speed hit as compared to the old
The description talks about "smart choice for
compression of multiple
items together", how would that work with deleting items?
Especially with history dumps, I think it would make a lot of sense to
use some kind of delta compression (like git's pack files do).
But this would cause problems with deleting revisions that other
revisions use as a base for their delta (though certainly not
I guess figuring this out would be a part of the project.
Yes, that's exactly right. Having a list of 'free blocks' which have
been zeroed out and are reclaimable on the next round of writes,
deciding whether or not a sort of 'defrag' would be needed, etc, these
are things that would have to be figured out in the course of
development and by testing with real-world data. Though I wouldn't
suggest testing with en wikipedia right away, there are other projects
with plenty of bot activity and regular editors that would work quite
well for that.
Delta compression was indeed on my mind when I wrote this description,
but th devil is in the details :-)
On Mon, Mar 25, 2013 at 12:22 PM, Ariel T. Glenn <ariel(a)wikimedia.org> wrote:
> So I was thinking about things I can't undertake, and one of those
> things is the 'dumps 2.0' which has been rolling around in the back of
> my mind. The TL;DR version is: sparse compressed archive format that
> allows folks to add/subtract changes to it random-access (including
> during generation).
> See here:
> What do folks think? Workable? Nuts? Low priority? Interested?
> Xmldatadumps-l mailing list