Knowing little about the current dump generation process, but some about terabyte-scale data handling (actually, we here are well into the petabyte range by now;-), how about this: * Set up the usual MySQL replication slave * At one point in time, disconnect it from the MySQL master, but leave it running in read-only mode * Use that as the dump base
This should result in a single-point-in-time snapshot. Also, it will reduce load to the rest of the system. Not sure if IDs will change internally, though.
Independent of that, * Run several parallel processes on several servers (assuming we have several) * Each process generates the complete history dump of a single article, or a small group of them, bz2ipped to save intermittent disk space * Success/faliure is checked, so each process can be rerun if needed * At the end, all these files are appended into a single bzip2/7zip file
This will need more diskspace while the entire thing is running, as small text files compress less well than larger ones. Also, it eats more CPU cycles, for starting all these processes, and then for re-bzip2ing the intermediate files.
But, it is a lot less error-prone (if a process or a bunch of them fail, just restart them), and it scales better (just throw more machines at it to make it faster; or use apaches during low-traffic hours). Individual processes should be less memory-intensive, so several of them can run on the same machine.
My 2c
Magnus