The updated gfdl-wikititle program has been posted to:
http://meta.wikimedia.org/wiki/Gfdl-wikititle
This program is released under the GNU GPL version 3. This release corrects failures with importDump.php and will remove and fix corrupted article titles in the XML dumps provided by the Foundation. At present, two articles in the enwiki-20070206 dumps will cause importDump.php to fail in MediaWiki 1.9.3 and prevent the XML dumps from being imported into MediaWiki and to fail with NULL title errors. The problem appears to be related to titles which A) exceed 256 bytes in length and B) also contain multi-part paths. This same bug manifests from MediaWiki 1.7.3 through MediaWiki 1.9.3. versions with the enwiki-20070206 XML Dumps.
This tool allows the XML dumps to be imported into MediaWIki 1.9.3 (which routinely fail on most Linux distros) by removing corrupted titles. This tools also supports insertion of interwiki links back into the originating Wikipedia language site for each dump to provide a GFDL compliant link back to the original article and its edit history and authors. I have also updated
http://meta.wikimedia.org/wiki/Data_dumps
with instructions on how to get around the problems with importing the dumps.
There were two titles which caused the failures:
Article number 2698248: <title>Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Greatest Hits Volume One (The Byrds)</title>
(I have censored the actual text in this article title and replaced it with X characters as it is inappropriate language to post to a public mailing list.
Article Number 4443711: <title>Wikipedia:Miscellany for deletion/Former (XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX) by Ashley Giles and Bradley Hogg.</title>
It may be prudent to consider a filter to remove titles longer than 256 characters when the dumps are made which contain multi-part paths. There were two titles in the last dumps and they both cause importDump.php to fail with NULL title errors.
Jeff
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jeffrey V. Merkey wrote:
The updated gfdl-wikititle program has been posted to:
http://meta.wikimedia.org/wiki/Gfdl-wikititle
This program is released under the GNU GPL version 3.
GNU GPL version 3 has not been released yet. Please don't use draft licenses.
The problem appears to be related to titles which A) exceed 256 bytes in length
The page_title field is limited to 255 bytes; you should probably check if your namespaces are set up properly. If they are not, the 'Wikipedia:' portion will be injected into the main title instead of being used in the namespace as designed.
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brion Vibber wrote:
The problem appears to be related to titles which A) exceed 256 bytes in length
The page_title field is limited to 255 bytes; you should probably check if your namespaces are set up properly. If they are not, the 'Wikipedia:' portion will be injected into the main title instead of being used in the namespace as designed.
(Of course it shouldn't be crashing anyway; it should just crop or reject the title. I'll try and get that fixed.)
- -- brion vibber (brion @ wikimedia.org)
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jeffrey V. Merkey wrote:
The updated gfdl-wikititle program has been posted to:
http://meta.wikimedia.org/wiki/Gfdl-wikititle
This program is released under the GNU GPL version 3.
GNU GPL version 3 has not been released yet. Please don't use draft licenses.
GNU GPL 3 (even the draft) reverts to a GPL 2 license since the language is "any past or fture version". GPL 2 is fine for now as well. I have a problem with the GPL 2 language as written. I can release under the generic term "GPL license" and I think that addresses it. GPL 3 as worded is retroactive.
The problem appears to be related to titles which A) exceed 256 bytes in length
The page_title field is limited to 255 bytes; you should probably check if your namespaces are set up properly. If they are not, the 'Wikipedia:' portion will be injected into the main title instead of being used in the namespace as designed.
You should consider fixing the dump program to only output title names which are less than 256 bytes. They are causing the dumps to not work. I have not changed namespace settings from the default settings for MediaWiki. When I got into this Brion, my whole premise was to use the MediaWiki software "as is" and get the dumps and all of it working "plug and play". I think I've achieved that goal now (though its a lot of work and steps to actully replicate Wikipedia).
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGC7z8wRnhpk1wk44RAlfbAKCW5BsiRKnT5mnISA4GK2vEVoCmZwCcDAcY tlKpvjJcr1tb+t9SQJwBcnk= =91B5 -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Brion Vibber wrote:
The page_title field is limited to 255 bytes; you should probably check if your namespaces are set up properly. If they are not, the 'Wikipedia:' portion will be injected into the main title instead of being used in the namespace as designed.
Having WikiGadugi as ns:project may be making trouble here. Shouldn't it be translated into proper namespace id per the namespaces tag? Hmm, you seem to have (the same) content both on WikiGadugi:* and Wikipedia:*...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Platonides wrote:
Brion Vibber wrote:
The page_title field is limited to 255 bytes; you should probably check if your namespaces are set up properly. If they are not, the 'Wikipedia:' portion will be injected into the main title instead of being used in the namespace as designed.
Having WikiGadugi as ns:project may be making trouble here. Shouldn't it be translated into proper namespace id per the namespaces tag? Hmm, you seem to have (the same) content both on WikiGadugi:* and Wikipedia:*...
Ok, I think I found why I couldn't reproduce the bug; 'wikipedia' is set up as an interwiki prefix on my test wiki, which leads to somewhat funny behavior on import. (The prefix is stripped.) Using the actual namespace selection also produces the correct behavior (correct namespace interpolation).
A quick note on namespaces and import: currently mwdumper interprets namespace prefixes and numbers purely according to the <siteinfo> namespace list; importDump.php and Special:Import interpret them according to the local wiki's configuration.
In theory it would be possible to do a match-by-number for the wiki's import as well, though it's not clear that's always desirable. (For instance for custom namespaces it could produce garbage, with utterly mismatched namespaces.)
Swapping the name to a same-length one which is not a registered prefix, I do get the Null failure. Will poke about to work around that.
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brion Vibber wrote:
Swapping the name to a same-length one which is not a registered prefix, I do get the Null failure. Will poke about to work around that.
As of r20828 on trunk, invalid titles are now skipped over with a warning during import.
- -- brion vibber (brion @ wikimedia.org)
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brion Vibber wrote:
Swapping the name to a same-length one which is not a registered prefix, I do get the Null failure. Will poke about to work around that.
As of r20828 on trunk, invalid titles are now skipped over with a warning during import.
Brion,
That's great. You may want to post an incremental update release of MediaWiki 1.9.4 (?) so we can advertise the dumps work "as is" with MediaWiki. I will update the Data Dumps page after I download the trunk changes and RTQA (regression test) the changes.
Jeff
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jeffrey V. Merkey wrote:
Article number 2698248:
<title>Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Greatest Hits Volume One (The Byrds)</title>
This title does not appear to fail on import with current code.
- -- brion vibber (brion @ wikimedia.org)
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jeffrey V. Merkey wrote:
Article number 2698248:
<title>Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Greatest Hits Volume One (The Byrds)</title>
This title does not appear to fail on import with current code.
It fails on FC5. I have plenty of logs if you wish to review them. I can also grant you an account on the target system and you can watch it fail real-time if you like.
Jeff
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGC9zJwRnhpk1wk44RAtlPAJwKYBtnVMLXNAe7yE7fdF931qjkSQCgkDdr q7b591YujVGWmxV3+Z0dN8o= =IVdh -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Jeffrey V. Merkey wrote:
Brion Vibber wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jeffrey V. Merkey wrote:
Article number 2698248:
<title>Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Greatest Hits Volume One (The Byrds)</title>
This title does not appear to fail on import with current code.
It fails on FC5. I have plenty of logs if you wish to review them. I can also grant you an account on the target system and you can watch it fail real-time if you like.
Jeff
And the title is over 256 bytes. I suspect its related to the namespace settings you described. I am using defaults in MediaWiki.
Jeff
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGC9zJwRnhpk1wk44RAtlPAJwKYBtnVMLXNAe7yE7fdF931qjkSQCgkDdr q7b591YujVGWmxV3+Z0dN8o= =IVdh -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org