They are XML dumps. Why did you say they are semi-useless?
I'm not sure if all the MediaWiki revision table parameters are available in the XML dumps, but most of them are.
2012/5/17 Anthony wikimail@inbox.org
On Thu, May 17, 2012 at 8:01 AM, emijrp emijrp@gmail.com wrote:
We at WikiTeam are uploading wiki dumps to Internet Archive, and recently some official mirrors of Wikimedia dumps (articles + images) are being created around the globe (currently in 3 different locations).
Are these actual database dumps, or are they those semi-useless XML dumps?
On Thu, May 17, 2012 at 8:22 AM, emijrp emijrp@gmail.com wrote:
They are XML dumps. Why did you say they are semi-useless?
Because they are XML dumps, mainly. The data in the WMF database is compressed in a format which can be easily randomly accessed. The dump procedure is to uncompress it, convert it to XML. and then recompress it, in a format which can't be easily randomly accessed. The import procedure is to uncompress the "dump", convert it from XML, and then recompress it in a format which is easily randomly accessed.
There are some hacks to get around this with the bz2 version of the "dump", but this is far less efficient than the format which the data already is in before the "dump" process takes place.
I'm not sure if all the MediaWiki revision table parameters are available in the XML dumps, but most of them are.
The main problem is that they are compressed in a format which is terrible for actual use. The missing information (mostly, indexes), is a secondary problem, however.
On 17 May 2012 13:32, Anthony wikimail@inbox.org wrote:
Because they are XML dumps, mainly. The data in the WMF database is compressed in a format which can be easily randomly accessed.
It's a dump. It's not supposed to be randomly accessed. We're talking about archives, not mirrors.
On Thu, May 17, 2012 at 8:38 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
On 17 May 2012 13:32, Anthony wikimail@inbox.org wrote:
Because they are XML dumps, mainly. The data in the WMF database is compressed in a format which can be easily randomly accessed.
It's a dump.
Not really. Yes, it's called that. And historically, it was that, but the XML "dumps" aren't really dumps at all.
It's not supposed to be randomly accessed. We're talking about archives, not mirrors.
That's why I said they're semi-useless (i.e. half-useless), not useless.
wikimedia-l@lists.wikimedia.org