Tim Starling wrote:
Di (rut) wrote:
Dear All, specially Anthony and Platonides,
I hope the heroic duo will give their blessing to this post.
You're welcome to kick in. :)
I'm not techy - so why hasn't it been possible to have a non-corrupt dump in a long time (that includes history?). A professor of mine asked if the problem could be man(person)-power and if it would be interesting/useful to have the university help with a programmer to help the dump happen.
In my opinion, it would be a lot easier to generate a full dump if it was split into multiple XML files for each wiki. Then the job could be checkpointed on the file level. Checkpoint/resume is quite difficult with the current single-file architecture.
I did a proposal on that line last month http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/34547 You're also welcomed to comment it ;) Although the main point seems to be if the files compression is good enough... The compression acceptable level varying due to things like WMF disk space available for dumps and the needing to have a better dump system.
Anthony wrote:
So if the files are ordered by title then by revision time there should be a whole lot of chunks which don't need to be uncompressed/recompressed every dump, and from what I've read compression is the current bottleneck.
The backup is based on having it sorted by id. Moreover, even changing that (ie. rewriting most of the code), you'd need to insert in the middle whenever a page gets a new revision.