Tomasz Finc wrote:
I've started drafting some new ideas at http://wikitech.wikimedia.org/view/Data_dump_redesign
of the various problems that were facing and what kind of job management we can put around it. Were taking this on as a full "should have been done 2 years ago" project and I'm going to be shepherding this along.
Right now I'm collecting stats about the throughput of the components to see how much in parallel this could be farmed out in a job management system.
This is a large project that has some distinct problem areas that we'll be isolating and welcoming help on.
--tomasz
Quite interesting. Can the images at office.wikimedia.org be moved to somewhere public?
Decompression takes as long as compression with bzip2
I think decompression is *faster* than compression http://tukaani.org/lzma/benchmarks
Let me know if I can help with anything.