[Wikimedia-l] Wikimedia sites not easy to archive (Was Re: Knol is closing tomorrow )

Anthony wikimail at inbox.org
Thu May 17 12:32:10 UTC 2012


On Thu, May 17, 2012 at 8:22 AM, emijrp <emijrp at gmail.com> wrote:
> They are XML dumps. Why did you say they are semi-useless?

Because they are XML dumps, mainly.  The data in the WMF database is
compressed in a format which can be easily randomly accessed.  The
dump procedure is to uncompress it, convert it to XML. and then
recompress it, in a format which can't be easily randomly accessed.
The import procedure is to uncompress the "dump", convert it from XML,
and then recompress it in a format which is easily randomly accessed.

There are some hacks to get around this with the bz2 version of the
"dump", but this is far less efficient than the format which the data
already is in before the "dump" process takes place.

> I'm not sure if all the MediaWiki revision table parameters are available in
> the XML dumps, but most of them are.

The main problem is that they are compressed in a format which is
terrible for actual use.  The missing information (mostly, indexes),
is a secondary problem, however.



More information about the Wikimedia-l mailing list