Reid Priedhorsky wrote:
Hi folks,
In our ongoing research here at UMN, we've discovered some reverts that introduce apparent character set problems; what seems to happen is that some Unicode characters are replaced by a character I don't recognize followed by a hexadecimal number. For example:
http://en.wikipedia.org/w/index.php?title=Dog&diff=58851026&oldid=58...
What I see is that a sequence of five characters that I don't have glyphs for, which show up as five boxes with the numbers "010337 01033F 01033D 010333 010343" in them, is replaced with the sequence "?df37?df3f?df3d?df33?df43", where ? is not the question mark but a black diamond with a white question mark in it (a zero byte?).
Do any of you have pointers on information as to what is going on?
We are trying to devise a workaround that would result in revisions like this comparing identical.
The problem seems to be due to a bug in an old version of a Python library. See e.g.
http://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/Inciden...
There are a number of reports in the bot's talk page. Apparently it took a block and a WP:AN/I report before it was fixed. It looks like it's converting a surrogate pair to a replacement character (U+FFFD) and a hexadecimal codepoint.
-- Tim Starling