Hi Brion,
I have completed my analysis of the latest breakage in the XML Dumps with importDump failing with the enwiki-20070206 XML Dump and it appears to be related to improper rendering of wiki text markup titles (<sup></sup>) when the dumps are rendered from sql to xml from the Wikimedia sites.
Here is an example of the article title in question that uses characters outside of wgLegalTitleChars as they are defined by default in MediaWiki 1.9.3:
Article Link:
http://en.wikipedia.org/wiki/P%C2%B2-irreducible
Wiki Text:
In [[mathematics]], a [[3-manifold]] is '''P<sup>2</sup>-irreducible''' if it is [[irreducible (mathematics)|irreducible]] and contains no [[2-sided]] <math>\mathbb RP^2</math> ([[real projective plane]]).
Article Title in enwiki-20070206 XML Dump File: 2698249:Image:Tripitaka storage2.jpg 2698250:Natalie Golda 2698251:Image:Big Passage outside Ampleforth College Library.jpg 2698252:Canine infectious hepatitis 2698253:P²-irreducible <-- article title causes importDump.php to fail 2698254:Image:Zebra sideview.jpg 2698255:Amy Freed 2698256:El cubano libre 2698257:Aujezky's disease virus 2698258:Wikipedia:Administrators' noticeboard/IncidentArchive92 2698259:Aujezky's disease
I have corrected this by basically allowing any characters of any kind to appear in titles.
i.e. $wgLegalTitleChars = "\x0-\xFF"
It may be a good idea to instrument a title filter and render these characters not included in the default MediaWiki setup as utf8 strings in the future and sidestep the perpetual breakage of importDump.php with the standard dumps provided by the foundation.
I will update the pages on meta with this information on how to get around all of the problems with importDump.php based on the current state of the XML dumps. One good thing came out of it. I managed to get some very comprehensive C based tools written that can analyze all of these errors and determine where in the XML dumps the problems seem to originate from.
Jeff