So I was thinking about things I can't undertake, and one of those things is the 'dumps 2.0' which has been rolling around in the back of my mind. The TL;DR version is: sparse compressed archive format that allows folks to add/subtract changes to it random-access (including during generation).
See here:
https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_dum...
What do folks think? Workable? Nuts? Low priority? Interested?
Ariel
This sounds really interesting to me (as in, I would seriously consider applying for this project).
Few questions: Do you think most of this should be written in PHP (since dumpBackup.php is currently in PHP)? Or could it be written in another language (most likely Python)?
The description talks about "smart choice for compression of multiple items together", how would that work with deleting items? Especially with history dumps, I think it would make a lot of sense to use some kind of delta compression (like git's pack files do). But this would cause problems with deleting revisions that other revisions use as a base for their delta (though certainly not unsolvable problems). I guess figuring this out would be a part of the project.
Petr Onderka [[en:User:Svick]]
On Mon, Mar 25, 2013 at 12:22 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
So I was thinking about things I can't undertake, and one of those things is the 'dumps 2.0' which has been rolling around in the back of my mind. The TL;DR version is: sparse compressed archive format that allows folks to add/subtract changes to it random-access (including during generation).
See here:
https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_dum...
What do folks think? Workable? Nuts? Low priority? Interested?
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Στις 25-03-2013, ημέρα Δευ, και ώρα 23:40 +0100, ο/η Petr Onderka έγραψε:
This sounds really interesting to me (as in, I would seriously consider applying for this project).
Few questions: Do you think most of this should be written in PHP (since dumpBackup.php is currently in PHP)? Or could it be written in another language (most likely Python)?
Well my thought was that to the extent it could take output frmo existing files (adds/changs dumps) it could be written in python or another language. At least a first take at it wouldd be as a separate toolset and not a part of MediaWiki core. Performance would be an issue at least in part; we want users to be able to do routine things with the new format without taking much of a speed hit as compared to the old format.
The description talks about "smart choice for compression of multiple items together", how would that work with deleting items? Especially with history dumps, I think it would make a lot of sense to use some kind of delta compression (like git's pack files do). But this would cause problems with deleting revisions that other revisions use as a base for their delta (though certainly not unsolvable problems). I guess figuring this out would be a part of the project.
Yes, that's exactly right. Having a list of 'free blocks' which have been zeroed out and are reclaimable on the next round of writes, deciding whether or not a sort of 'defrag' would be needed, etc, these are things that would have to be figured out in the course of development and by testing with real-world data. Though I wouldn't suggest testing with en wikipedia right away, there are other projects with plenty of bot activity and regular editors that would work quite well for that.
Delta compression was indeed on my mind when I wrote this description, but th devil is in the details :-)
Ariel
Petr Onderka [[en:User:Svick]]
On Mon, Mar 25, 2013 at 12:22 PM, Ariel T. Glenn ariel@wikimedia.org wrote:
So I was thinking about things I can't undertake, and one of those things is the 'dumps 2.0' which has been rolling around in the back of my mind. The TL;DR version is: sparse compressed archive format that allows folks to add/subtract changes to it random-access (including during generation).
See here:
https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_dum...
What do folks think? Workable? Nuts? Low priority? Interested?
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org