Yuvi Panda wrote:
Hi, I'm Yuvi, a student looking forward to working with MediaWiki via this year's GSoC.
I want to work on something dump related, and have been bugging apergos (Ariel) for a while now. One of the things that popped up into my head is moving the dump process to another language (say, C#, or Java, or be very macho and do C++ or C). This would give the dump process quite a bit of a speed bump (The profiling I did[1] seems to indicate that the DB is not the bottleneck. Might be wrong though), and can also be done in a way that makes running distributed dumps easier/more elegant.
So, thoughts on this? Is 'Move Dumping Process to another language' a good idea at all?
P.S. I'm just looking out for ideas, so if you have specific improvements to the dumping process in mind, please respond with those too. I already have DistributedBZip2 and Incremental Dumps in mind too :)
Thanks :)
An idea I have been pondering is to pass the offset to the previous revision to the compressor, so it would need much less work in the compressing window to perform its work. You would need something like 7z/xz so that the window can be big enough to contain at least the latest revision (its compression factor is quite impressive, too: 1TB down to 2.31GB). Note that I haven't checked on how factible it can be such modification to the compressor.