Hi Brian, Brion once explained to me that the post processing of the dump is
the main bottleneck.
Compressing articles with tens of thousands of revisions is a major resource
drain.
Right now every dump is even compressed twice, into bzip2 (for wider
platform compatibility) and 7zip format (for 20 times smaller downloads).
This may no longer be needed as 7zip presumably gained better support on
major platforms over the years.
Apart from that the job could gain from parallelization and better error
recovery.
Erik Zachte
________________________________________
I am still quite shocked at the amount of time the english wikipedia takes
to dump, especially since we seem to have close links to folks who work at
mysql. To me it seems that one of two things must be the case:
1. Wikipedia has outgrown mysql, in the sense that, while we can put data
in, we cannot get it all back out.
2. Despite aggressive hardware purchases over the years, the correct
hardware has still not been purchased.
I wonder which of these is the case. Presumably #2 ?
Cheers,
Brian