On Thu, May 17, 2012 at 8:22 AM, emijrp emijrp@gmail.com wrote:
They are XML dumps. Why did you say they are semi-useless?
Because they are XML dumps, mainly. The data in the WMF database is compressed in a format which can be easily randomly accessed. The dump procedure is to uncompress it, convert it to XML. and then recompress it, in a format which can't be easily randomly accessed. The import procedure is to uncompress the "dump", convert it from XML, and then recompress it in a format which is easily randomly accessed.
There are some hacks to get around this with the bz2 version of the "dump", but this is far less efficient than the format which the data already is in before the "dump" process takes place.
I'm not sure if all the MediaWiki revision table parameters are available in the XML dumps, but most of them are.
The main problem is that they are compressed in a format which is terrible for actual use. The missing information (mostly, indexes), is a secondary problem, however.