I've spent more time than I care to admit loading, dumping, reloading, tranforming, testing, reloading...various wikipedia databases before settling on what I think the new format will be, but I made a discovery along the way that might be useful:
The 05/20 database dump from wikipedia weighs in at close to 600 MB. It turns out that almost 200 MB of that is cache. In the new system, I'll write a function specifically for doing database dumps, but in the meantime I'd suggest that the next time you dump a tarball, clear the cache first (and don't forget to be careful of the timestamps when you do). 0
wikitech-l@lists.wikimedia.org