[Foundation-l] GFDL-wikititle updated (resolves importDump.php failures with enwiki-20070206 XML Dumps)

Jeffrey V. Merkey jmerkey at wolfmountaingroup.com
Thu Mar 29 09:28:17 UTC 2007


The updated gfdl-wikititle program has been posted to:

http://meta.wikimedia.org/wiki/Gfdl-wikititle

This program is released under the GNU GPL version 3.   This release 
corrects failures with importDump.php and will remove and fix
corrupted article titles in the XML dumps provided by the Foundation.  
At present, two articles in the enwiki-20070206 dumps will cause
importDump.php to fail in MediaWiki 1.9.3 and prevent the XML dumps from 
being imported into MediaWiki and to fail with NULL title
errors.   The problem appears to be related to titles which A) exceed 
256 bytes in length and B) also contain multi-part paths.   This same 
bug manifests from MediaWiki 1.7.3 through MediaWiki 1.9.3. versions 
with the enwiki-20070206 XML Dumps.

This tool allows the XML dumps to be imported into MediaWIki 1.9.3 
(which routinely fail on most Linux distros) by removing corrupted 
titles.  This tools also supports insertion of interwiki links back into 
the originating Wikipedia language site for each dump to provide a GFDL 
compliant link back to the original article and its edit history and 
authors.  I have also updated

http://meta.wikimedia.org/wiki/Data_dumps

with instructions on how to get around the problems with importing the 
dumps. 

There were two titles which caused the failures:

Article number 2698248:
<title>Wikipedia:Articles for deletion/Wikipedia:Articles for 
deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for 
deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for 
deletion/Wikipedia:Articles for deletion/Greatest Hits Volume One (The 
Byrds)</title>

(I have censored the actual text in this article title and replaced it 
with X characters as it is inappropriate language to post to a public 
mailing list.

Article Number 4443711:
<title>Wikipedia:Miscellany for deletion/Former 
(XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX) by Ashley 
Giles and Bradley Hogg.</title>

It may be prudent to consider a filter to remove titles longer than 256 
characters when the dumps are made which contain multi-part paths.  
There were two titles in the last dumps and they both cause 
importDump.php to fail with NULL title errors.

Jeff




More information about the foundation-l mailing list