I see that the latest dump of the English Wikipedia failed (I mean, the dump of all the page histories). As part of some other work I am doing, I have efficient code that can "take apart" a dump into its single component pages, and out of that, it would be possible to fashion code that "stitches together" various partial dumps. This would allow to break up a single dump process into multiple, shorter processes, in which for example one only dumps one month worth, or one week worth, of revisions to the English wikipedia. Breaking up the dump process will increase the probability that each of the smaller dumps succeeds. For instance, one could have all the partial dumps, the launch the stitching process, and the stitching process produces a single dump, removing duplicate revisions.
At UCSC, where I work, there are various Master students looking for projects... and some may be interested in doing work that is concretely useful to the Wikipedia. Should I try to get them interested in writing a proper dump stitching tool, and some code to do partial dumps?
Can Brion or Tim give us more detail on why the dumps are failing? Are they already doing partial dumps? Is there already a dump stitching tool? Is there anything that could be done to help the process? I could help by looking for database students in search of a project and giving them my code as a starting point...
Best,
Luca
On 9/30/07, Thomas Dalton thomas.dalton@gmail.com wrote:
Can Brion or Tim give us more detail on why the dumps are failing?
That's the key question. Without knowing exactly what the problem is, it's very hard to come up with solutions.
Yes, surely to fix the problem of breaking dump, it should be known the details and, if possible, the source.
But what was proposed by Luca may be interesting for other reasons too.
The idea from the idea of have different parts merged together has risen me the question if it is possible in that way not do do the full dump every time, but to use previous dumps (or more reasonable part of them) to create the new one.
Now unfortunately I do not know much on the dump process, so the following are only sparse consideration.
The easier case is that of pages that are not modified. Can in this case the old dump be reused for that? And in this case the best advantage would be if there is a some set of pages, each of them dumped to a separate file and I know (how?) that all the pages of a particular set were unmodified: in that case for the whole set the old file could be reused.
But even if the page were edited can the old versions be taken from a old dump (or from a partial file for a previous dump)?
And the reasons that has risen me curiosity on that is not just to improve the and speed up the dumping process on the wiki server, but also to find a way to reduce the length.
While until now what was considered was to create partial dump and then merging them to create a full dump, one can try to find a way so that the user who download can download instead of the full dump just the modified set, and, moving on this path, if it is possible to download just the diff.
And I am not speaking of just the full dump of the "All pages with complete edit history". Also other dump can be rather large to download. For instance the en wikipedia dump of current version of "Articles, templates, image descriptions, and primary meta-pages" is now at 2.8 GB. Not at all a small file especially for someone who has limited internet access.
Of course I fully understand that all of this is not so easy to implement (for instance dealing with delete revisions can be not easy), but before discussing about the difficulties, I would like to know if you consider this objectives interesting.
Regards AnyFile
The idea from the idea of have different parts merged together has risen me the question if it is possible in that way not do do the full dump every time, but to use previous dumps (or more reasonable part of them) to create the new one.
I believe that's how the backups are done. A full backup is taken every day, or whatever, and inbetween backups the details of every change to the database is stored, and then if something goes wrong you can restore to the last backup and roll forward using the stored changes. I don't know if it's possible to use a similar system for dumps, but it might be worth considering.
wikitech-l@lists.wikimedia.org