Brion Vibber wrote:
Dorożyński Janusz wrote:
So, is any chance that people can take from download.wikimedia.org .sql dumps? Xml dumps are completely useless for them.
No you can't get SQL dumps. :)
- The schema and compression formats keep changing, which breaks
things for people trying to get at the data.
- There is no longer any equivalent to the "cur table" for
current-revision-only SQL dumps. [...snip...] If you like you can use the mwdumper tool to convert the XML dumps to local-import-friendly SQL instead of using importDump.php (which as you note needs a bug fix).
Can I please make a suggestion? Can the XML format be run through the mwdumper (or equivalent), and the result SQL _of that process_ be compressed and uploaded to the database dump site? That way everything can change from MediaWiki perspective, and it won't make any difference to whether or not the SQL dumps can be created (as long as the XML dumps can be created, the SQL ones can too). Please spare a though for those of who don't care for XML religion, and who simply want to get the data into a database.
Also, can we please have back the "is_redirect" field in the XML (and XML->SQL) output, that used to be in the cur SQL dump? ( Yes, I know I can generate it myself, but it is useful data, and may well be useful to many people - making each and every one of those users independently generate this info seems counterproductive).
Diff of a page's XML might look like this: ================================================== <page> <title>AccessibleComputing</title> <id>10</id> <revision> <id>15898945</id> <timestamp>2003-04-25T22:18:38Z</timestamp> <contributor> <username>Ams80</username> <id>7543</id> </contributor> <minor /> + <redirect /> <comment>Fixing redirect</comment> <text xml:space="preserve">#REDIRECT [[Accessible_computing]]</text> </revision> </page> ==================================================
Jakob Voss wrote:
When I tried to parse the current German XML dump I discovered the following malformed sequence (in [[de:India]]):
[[got:��...
I got similar errors on EN running "xmllint 20050909_pages_current.xml" on Debian Linux. Xmllint seems to be quick way to test the validity of the XML dump.
All the best, Nick. (aka EN user:Nickj).
Nick Jenkins wrote:
Brion Vibber wrote:
If you like you can use the mwdumper tool to convert the XML dumps to local-import-friendly SQL instead of using importDump.php (which as you note needs a bug fix).
Can I please make a suggestion? Can the XML format be run through the mwdumper (or equivalent), and the result SQL _of that process_ be compressed and uploaded to the database dump site?
This would triple the disk space requirements for the data dumps (quadruple after the next major upgrade, quintuple the time after that...), and maybe a couple people might use some of them every once in a while.
One of the many reasons behind moving to the new dump format is so we don't _have_ to do that; you can transform to whatever local format you need. (And we provide software for you to do that if you like.)
Also, can we please have back the "is_redirect" field in the XML (and XML->SQL) output, that used to be in the cur SQL dump? ( Yes, I know I can generate it myself, but it is useful data, and may well be useful to many people - making each and every one of those users independently generate this info seems counterproductive).
Hmm, can probably do that yeah.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org