On 9/30/07, Thomas Dalton thomas.dalton@gmail.com wrote:
Can Brion or Tim give us more detail on why the dumps are failing?
That's the key question. Without knowing exactly what the problem is, it's very hard to come up with solutions.
Yes, surely to fix the problem of breaking dump, it should be known the details and, if possible, the source.
But what was proposed by Luca may be interesting for other reasons too.
The idea from the idea of have different parts merged together has risen me the question if it is possible in that way not do do the full dump every time, but to use previous dumps (or more reasonable part of them) to create the new one.
Now unfortunately I do not know much on the dump process, so the following are only sparse consideration.
The easier case is that of pages that are not modified. Can in this case the old dump be reused for that? And in this case the best advantage would be if there is a some set of pages, each of them dumped to a separate file and I know (how?) that all the pages of a particular set were unmodified: in that case for the whole set the old file could be reused.
But even if the page were edited can the old versions be taken from a old dump (or from a partial file for a previous dump)?
And the reasons that has risen me curiosity on that is not just to improve the and speed up the dumping process on the wiki server, but also to find a way to reduce the length.
While until now what was considered was to create partial dump and then merging them to create a full dump, one can try to find a way so that the user who download can download instead of the full dump just the modified set, and, moving on this path, if it is possible to download just the diff.
And I am not speaking of just the full dump of the "All pages with complete edit history". Also other dump can be rather large to download. For instance the en wikipedia dump of current version of "Articles, templates, image descriptions, and primary meta-pages" is now at 2.8 GB. Not at all a small file especially for someone who has limited internet access.
Of course I fully understand that all of this is not so easy to implement (for instance dealing with delete revisions can be not easy), but before discussing about the difficulties, I would like to know if you consider this objectives interesting.
Regards AnyFile