[Foundation-l] GFDL-wikititle updated (resolves importDump.php failures with enwiki-20070206 XML Dumps)
Jeffrey V. Merkey
jmerkey at wolfmountaingroup.com
Thu Mar 29 09:28:17 UTC 2007
The updated gfdl-wikititle program has been posted to:
http://meta.wikimedia.org/wiki/Gfdl-wikititle
This program is released under the GNU GPL version 3. This release
corrects failures with importDump.php and will remove and fix
corrupted article titles in the XML dumps provided by the Foundation.
At present, two articles in the enwiki-20070206 dumps will cause
importDump.php to fail in MediaWiki 1.9.3 and prevent the XML dumps from
being imported into MediaWiki and to fail with NULL title
errors. The problem appears to be related to titles which A) exceed
256 bytes in length and B) also contain multi-part paths. This same
bug manifests from MediaWiki 1.7.3 through MediaWiki 1.9.3. versions
with the enwiki-20070206 XML Dumps.
This tool allows the XML dumps to be imported into MediaWIki 1.9.3
(which routinely fail on most Linux distros) by removing corrupted
titles. This tools also supports insertion of interwiki links back into
the originating Wikipedia language site for each dump to provide a GFDL
compliant link back to the original article and its edit history and
authors. I have also updated
http://meta.wikimedia.org/wiki/Data_dumps
with instructions on how to get around the problems with importing the
dumps.
There were two titles which caused the failures:
Article number 2698248:
<title>Wikipedia:Articles for deletion/Wikipedia:Articles for
deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for
deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for
deletion/Wikipedia:Articles for deletion/Greatest Hits Volume One (The
Byrds)</title>
(I have censored the actual text in this article title and replaced it
with X characters as it is inappropriate language to post to a public
mailing list.
Article Number 4443711:
<title>Wikipedia:Miscellany for deletion/Former
(XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX) by Ashley
Giles and Bradley Hogg.</title>
It may be prudent to consider a filter to remove titles longer than 256
characters when the dumps are made which contain multi-part paths.
There were two titles in the last dumps and they both cause
importDump.php to fail with NULL title errors.
Jeff
More information about the foundation-l
mailing list