On Nov 18, 2007 3:33 PM, Platonides Platonides@gmail.com wrote:
Anthony wrote:
So if the files are ordered by title then by revision time there should be a whole lot of chunks which don't need to be uncompressed/recompressed every dump, and from what I've read compression is the current bottleneck.
The backup is based on having it sorted by id. Moreover, even changing that (ie. rewriting most of the code), you'd need to insert in the middle whenever a page gets a new revision.
It's sorted by page_id, so it's fine. Probably would benefit from rewriting the code though, at least porting it to C.
You'd rewrite the entire file, just not recompress all of it. Partial chunks (like at the end of a page_id) would have to be uncompressed and recompressed, but fortunately the bzip2 spec allows for small chunks.
If I have free time some weekend I'll throw together a proof of concept. But for now I think the more pressing issue is allowing resumption of broken dumps.
As for rsync, I don't see the point. The HTTP protocol allows random file access.