Hi Brion,
I have completed my analysis of the latest breakage in the XML Dumps with importDump failing with the enwiki-20070206 XML Dump and it appears to be related to improper rendering of wiki text markup titles (<sup></sup>) when the dumps are rendered from sql to xml from the Wikimedia sites.
Here is an example of the article title in question that uses characters outside of wgLegalTitleChars as they are defined by default in MediaWiki 1.9.3:
Article Link:
http://en.wikipedia.org/wiki/P%C2%B2-irreducible
Wiki Text:
In [[mathematics]], a [[3-manifold]] is '''P<sup>2</sup>-irreducible''' if it is [[irreducible (mathematics)|irreducible]] and contains no [[2-sided]] <math>\mathbb RP^2</math> ([[real projective plane]]).
Article Title in enwiki-20070206 XML Dump File: 2698249:Image:Tripitaka storage2.jpg 2698250:Natalie Golda 2698251:Image:Big Passage outside Ampleforth College Library.jpg 2698252:Canine infectious hepatitis 2698253:P²-irreducible <-- article title causes importDump.php to fail 2698254:Image:Zebra sideview.jpg 2698255:Amy Freed 2698256:El cubano libre 2698257:Aujezky's disease virus 2698258:Wikipedia:Administrators' noticeboard/IncidentArchive92 2698259:Aujezky's disease
I have corrected this by basically allowing any characters of any kind to appear in titles.
i.e. $wgLegalTitleChars = "\x0-\xFF"
It may be a good idea to instrument a title filter and render these characters not included in the default MediaWiki setup as utf8 strings in the future and sidestep the perpetual breakage of importDump.php with the standard dumps provided by the foundation.
I will update the pages on meta with this information on how to get around all of the problems with importDump.php based on the current state of the XML dumps. One good thing came out of it. I managed to get some very comprehensive C based tools written that can analyze all of these errors and determine where in the XML dumps the problems seem to originate from.
Jeff
Jeff V. Merkey wrote:
Hi Brion,
I have completed my analysis of the latest breakage in the XML Dumps with importDump failing with the enwiki-20070206 XML Dump and it appears to be related to improper rendering of wiki text markup titles (<sup></sup>) when the dumps are rendered from sql to xml from the Wikimedia sites.
Here is an example of the article title in question that uses characters outside of wgLegalTitleChars as they are defined by default in MediaWiki 1.9.3:
Article Link:
The default in 1.9.x is:
$wgLegalTitleChars = " %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+";
This includes all multibyte characters due to the \x80-\xFF range near the end, including your example "²". The value used on Wikimedia is identical to the default except for the order of characters in the class:
$wgLegalTitleChars = "+ %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF";
Did you perhaps accidentally remove the \x80-\xFF range at some stage?
-- Tim Starling
The default in 1.9.x is:
$wgLegalTitleChars = " %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+";
This includes all multibyte characters due to the \x80-\xFF range near the end, including your example "²". The value used on Wikimedia is identical to the default except for the order of characters in the class:
$wgLegalTitleChars = "+ %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF";
Did you perhaps accidentally remove the \x80-\xFF range at some stage?
No, I did not remove it. I am re-running importDump with the debug logging enabled and a debugger. It appears the problem is more involved than previously reported, which is why I delayed on updating the fix to the data dumps page on meta. I am re-reunning the program to debug further, modifying the title chars fixed one title only for another to crash further down in the dump. It takes several hours to get to the point in the dump I am seeing the corruption and error, Should crash in another 30 minutes or so again so I can post morten it again.
Jeff
Jeffrey V. Merkey wrote:
The default in 1.9.x is:
$wgLegalTitleChars = " %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+";
This includes all multibyte characters due to the \x80-\xFF range near the end, including your example "²". The value used on Wikimedia is identical to the default except for the order of characters in the class:
$wgLegalTitleChars = "+ %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF";
Did you perhaps accidentally remove the \x80-\xFF range at some stage?
No, I did not remove it. I am re-running importDump with the debug logging enabled and a debugger. It appears the problem is more involved than previously reported, which is why I delayed on updating the fix to the data dumps page on meta. I am re-reunning the program to debug further, modifying the title chars fixed one title only for another to crash further down in the dump. It takes several hours to get to the point in the dump I am seeing the corruption and error, Should crash in another 30 minutes or so again so I can post morten it again.
Jeff
Confirmed precise location of the failure. The number on the left hnd side is the article number.
2698244:Dog adenovirus 2698245:David A. Caputo 2698246:Famous Detective Conan (Case Closed 2698247:William Hughes Mulligan
THIS TITLE PRODUCES THE IMPORT DUMP FAILURE.
2698248:Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Greatest Hits Volume One (The Byrds)
2698249:Image:Tripitaka storage2.jpg 2698250:Natalie Golda 2698251:Image:Big Passage outside Ampleforth College Library.jpg 2698252:Canine infectious hepatitis 2698253:P²-irreducible 2698254:Image:Zebra sideview.jpg 2698255:Amy Freed
Jeff
Jeffrey V. Merkey wrote:
THIS TITLE PRODUCES THE IMPORT DUMP FAILURE.
2698248:Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Greatest Hits Volume One (The Byrds)
pretty long title, huh?
-- chris
christoph.huesler@css.ch wrote:
Jeffrey V. Merkey wrote:
THIS TITLE PRODUCES THE IMPORT DUMP FAILURE.
2698248:Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Wikipedia:Articles for deletion/Greatest Hits Volume One (The Byrds)
pretty long title, huh?
Pretty broken title. Check the 20070206 XML dumps and guess what's it it -- this title.
:-)
Jeff
-- chris
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org