Brion Vibber wrote:
Hypothetically, almost anyplace that crops strings or otherwise does internal string manipulation other than with Language::truncate() could end up spitting out bad UTF-8. :P
Web POST and GET input is sanitized (normalized and bad character stripped) at the WebRequest level, but internal processing is not always pure, and in most cases output is not sanitized either. Incorrect cropping of long values in limited-length database fields is another possibility.
(Note that XML dump generation specifically runs a UTF-8 cleanup step on output; the XML dump output is thus guaranteed to be UTF-8-clean and NFC.)
This statement is true. I have never seen a Wikipeida XML dump with unicode errors.
Jeff