Hello!
While working on my improvements to MediaWiki Import&Export, I've discovered a feature that is totally new for me: 2-phase backup dump. I.e. the first pass dumper creates XML file without page texts, and the second pass dumper adds page texts.
I have several questions about it - what it is intended for? Is it a sort of optimisation for large databases and why such method of optimisation was chosen?
Also, does anyone use it? (does Wikimedia use it?)
On 11/21/12 1:54 PM, vitalif@yourcmc.ru wrote:
While working on my improvements to MediaWiki Import&Export, I've discovered a feature that is totally new for me: 2-phase backup dump. I.e. the first pass dumper creates XML file without page texts, and the second pass dumper adds page texts.
I have several questions about it - what it is intended for? Is it a sort of optimisation for large databases and why such method of optimisation was chosen?
Also, does anyone use it? (does Wikimedia use it?)
I'm not sure if this is the reason it was created, but one useful outcome is that Wikimedia can make the output of both passes available at dumps.wikimedia.org. This can be useful for researchers (myself included), because the metadata-only (pass 1) dump is sufficient for doing some kinds of analyses, while being *much* smaller than the full dump.
-Mark
On Wed, Nov 21, 2012 at 4:54 AM, vitalif@yourcmc.ru wrote:
Hello!
While working on my improvements to MediaWiki Import&Export, I've discovered a feature that is totally new for me: 2-phase backup dump. I.e. the first pass dumper creates XML file without page texts, and the second pass dumper adds page texts.
I have several questions about it - what it is intended for? Is it a sort of optimisation for large databases and why such method of optimisation was chosen?
While generating a full dump, we're holding the database connection open.... for a long, long time. Hours, days, or weeks in the case of English Wikipedia.
There's two issues with this: * the DB server needs to maintain a consistent snapshot of data since when we started the connection, so it's doing extra work to keep old data around * the DB connection needs to actually remain open; if the DB goes down or the dump process crashes, whoops! you just lost all your work.
So, grabbing just the page and revision metadata lets us generate a file with a consistent snapshot as quickly as possible. We get to let the databases go, and the second pass can die and restart as many times as it needs while fetching actual text, which is immutable (thus no worries about consistency in the second pass).
We definitely use this system for Wikimedia's data dumps!
-- brion
Brion Vibber wrote 2012-11-21 23:20:
While generating a full dump, we're holding the database connection open.... for a long, long time. Hours, days, or weeks in the case of English Wikipedia.
There's two issues with this:
- the DB server needs to maintain a consistent snapshot of data since
when we started the connection, so it's doing extra work to keep old data around
- the DB connection needs to actually remain open; if the DB goes
down or the dump process crashes, whoops! you just lost all your work.
So, grabbing just the page and revision metadata lets us generate a file with a consistent snapshot as quickly as possible. We get to let the databases go, and the second pass can die and restart as many times as it needs while fetching actual text, which is immutable (thus no worries about consistency in the second pass).
We definitely use this system for Wikimedia's data dumps!
Oh, thanks, now I understand! But the revisions are also immutable - isn't it simpler just to select maximum revision ID in the beginning of dump and just discard newer page and image revisions during dump generation?
Also, I have the same question about 'spawn' feature of backupTextPass.inc :) what is it intended for? :)
On Wed, Nov 21, 2012 at 12:31 PM, vitalif@yourcmc.ru wrote:
Oh, thanks, now I understand! But the revisions are also immutable - isn't it simpler just to select maximum revision ID in the beginning of dump and just discard newer page and image revisions during dump generation?
Page history structure isn't quite immutable; revisions may be added or deleted, pages may be renamed, etc etc.
Also, I have the same question about 'spawn' feature of backupTextPass.inc :) what is it intended for? :)
Shelling out to an external process means when that process dies due to a dead database connection etc, we can restart it cleanly.
-- brion
Page history structure isn't quite immutable; revisions may be added or deleted, pages may be renamed, etc etc.
Shelling out to an external process means when that process dies due to a dead database connection etc, we can restart it cleanly.
Brion, thanks for clarifying it.
Also, I want to ask you and other developers about the idea of packing export XML file along with all exported uploads to ZIP archive (instead of putting them to XML in base64) - what do you think about it? We use it in our Mediawiki installations ("mediawiki4intranet") and find it quite convenient. Actually, ZIP was the idea of Tim Starling, before ZIP we used very strange "multipart/related" archives (I don't know why we did it :))
I want to try to get this change reviewed at last... What do you think about it?
Other improvements include advanced page selection (based on namespaces, categories, dates, imagelinks, templatelinks and pagelinks) and an advanced import report (including some sort of "conflict detection"). I should probably need to split them to separate patches in Gerrit for the ease of review?
Also, do all the archiving methods (7z) really need to be built in the Export.php as dump filters? (especially when using ZIP?) I.e. with simple XML dumps you could just pipe the output to the compressor.
Or are they really needed to save the temporary disk space during export? I ask because my version of import/export does not build the archive "on-the-fly" - it puts all the contents to a temporary directory and then archives it fully. Is it an acceptable method?
On 11/25/12 22:16, vitalif@yourcmc.ru wrote:
Also, I want to ask you and other developers about the idea of packing export XML file along with all exported uploads to ZIP archive (instead of putting them to XML in base64) - what do you think about it? We use it in our Mediawiki installations ("mediawiki4intranet") and find it quite convenient. Actually, ZIP was the idea of Tim Starling, before ZIP we used very strange "multipart/related" archives (I don't know why we did it :))
I want to try to get this change reviewed at last... What do you think about it?
Looks a better solution than base64 files. :)
Other improvements include advanced page selection (based on namespaces, categories, dates, imagelinks, templatelinks and pagelinks) and an advanced import report (including some sort of "conflict detection"). I should probably need to split them to separate patches in Gerrit for the ease of review?
I don't see a need to split eg. templatelinks selection and pagelinks selection. But if you provide a 64K patch, you may have a hard time getting people to review it :) I would probably generate a couple of patches, one with the selection parameters and the other with the advanced report. Depending on how big are those changes, YMMV.
Also, do all the archiving methods (7z) really need to be built in the Export.php as dump filters? (especially when using ZIP?) I.e. with simple XML dumps you could just pipe the output to the compressor.
Or are they really needed to save the temporary disk space during export? I ask because my version of import/export does not build the archive "on-the-fly" - it puts all the contents to a temporary directory and then archives it fully. Is it an acceptable method?
Probably not the best method, but a suboptimal implementation that works is better than no implementation at all. So go ahead and submit it. We can then be picky later in front of the code :)
Regards
wikitech-l@lists.wikimedia.org