Knowing little about the current dump generation process, but some
about terabyte-scale data handling (actually, we here are well into
the petabyte range by now;-), how about this:
* Set up the usual MySQL replication slave
* At one point in time, disconnect it from the MySQL master, but leave
it running in read-only mode
* Use that as the dump base
This should result in a single-point-in-time snapshot. Also, it will
reduce load to the rest of the system. Not sure if IDs will change
internally, though.
Independent of that,
* Run several parallel processes on several servers (assuming we have several)
* Each process generates the complete history dump of a single
article, or a small group of them, bz2ipped to save intermittent disk
space
* Success/faliure is checked, so each process can be rerun if needed
* At the end, all these files are appended into a single bzip2/7zip file
This will need more diskspace while the entire thing is running, as
small text files compress less well than larger ones. Also, it eats
more CPU cycles, for starting all these processes, and then for
re-bzip2ing the intermediate files.
But, it is a lot less error-prone (if a process or a bunch of them
fail, just restart them), and it scales better (just throw more
machines at it to make it faster; or use apaches during low-traffic
hours). Individual processes should be less memory-intensive, so
several of them can run on the same machine.
My 2c
Magnus