Hi folks,
In our ongoing research here at UMN, we've discovered some reverts that introduce apparent character set problems; what seems to happen is that some Unicode characters are replaced by a character I don't recognize followed by a hexadecimal number. For example:
http://en.wikipedia.org/w/index.php?title=Dog&diff=58851026&oldid=58...
What I see is that a sequence of five characters that I don't have glyphs for, which show up as five boxes with the numbers "010337 01033F 01033D 010333 010343" in them, is replaced with the sequence "?df37?df3f?df3d?df33?df43", where ? is not the question mark but a black diamond with a white question mark in it (a zero byte?).
Do any of you have pointers on information as to what is going on?
We are trying to devise a workaround that would result in revisions like this comparing identical.
Many thanks,
Reid
Reid Priedhorsky wrote:
Hi folks,
In our ongoing research here at UMN, we've discovered some reverts that introduce apparent character set problems; what seems to happen is that some Unicode characters are replaced by a character I don't recognize followed by a hexadecimal number. For example:
http://en.wikipedia.org/w/index.php?title=Dog&diff=58851026&oldid=58...
What I see is that a sequence of five characters that I don't have glyphs for, which show up as five boxes with the numbers "010337 01033F 01033D 010333 010343" in them, is replaced with the sequence "?df37?df3f?df3d?df33?df43", where ? is not the question mark but a black diamond with a white question mark in it (a zero byte?).
Do any of you have pointers on information as to what is going on?
We are trying to devise a workaround that would result in revisions like this comparing identical.
Many thanks,
Reid
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
I think this problem is in the antivandalbot or in the machine running it.. that is an old edit btw..may be it is fixed now..
Mohamed Magdy wrote:
Reid Priedhorsky wrote:
Hi folks,
In our ongoing research here at UMN, we've discovered some reverts that introduce apparent character set problems; what seems to happen is that some Unicode characters are replaced by a character I don't recognize followed by a hexadecimal number. For example:
http://en.wikipedia.org/w/index.php?title=Dog&diff=58851026&oldid=58...
What I see is that a sequence of five characters that I don't have glyphs for, which show up as five boxes with the numbers "010337 01033F 01033D 010333 010343" in them, is replaced with the sequence "?df37?df3f?df3d?df33?df43", where ? is not the question mark but a black diamond with a white question mark in it (a zero byte?).
Do any of you have pointers on information as to what is going on?
We are trying to devise a workaround that would result in revisions like this comparing identical.
I think this problem is in the antivandalbot or in the machine running it.. that is an old edit btw..may be it is fixed now..
OK, that's good info. Thanks.
Re-reading my text above, I should clarify: we are making comparisons of revision text offline using our own custom software, and we would like pairs of revisions like the above to compare identical (after all, AntiVandalBot was trying to make a revert), but they don't because of this bug. I'm trying to scope the issue and devise a workaround.
As we're wandering through the historical dumps, it doesn't solve our problem that the bug has been fixed. :)
Take care,
Reid
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Reid Priedhorsky wrote:
In our ongoing research here at UMN, we've discovered some reverts that introduce apparent character set problems; what seems to happen is that some Unicode characters are replaced by a character I don't recognize followed by a hexadecimal number. For example:
http://en.wikipedia.org/w/index.php?title=Dog&diff=58851026&oldid=58...
What I see is that a sequence of five characters that I don't have glyphs for, which show up as five boxes with the numbers "010337 01033F 01033D 010333 010343" in them, is replaced with the sequence "?df37?df3f?df3d?df33?df43", where ? is not the question mark but a black diamond with a white question mark in it (a zero byte?).
That would appear to be a bug in whatever bot tool was used to make the reversion last year.
The gothic characters are outside of Unicode's BMP (Basic Multilingual Plane), the first 16-bit subset of Unicode which is most widely supported. The tool appears to have had trouble either decoding or re-encoding them.
- -- brion vibber (brion @ wikimedia.org)
Reid Priedhorsky wrote:
Hi folks,
In our ongoing research here at UMN, we've discovered some reverts that introduce apparent character set problems; what seems to happen is that some Unicode characters are replaced by a character I don't recognize followed by a hexadecimal number. For example:
http://en.wikipedia.org/w/index.php?title=Dog&diff=58851026&oldid=58...
What I see is that a sequence of five characters that I don't have glyphs for, which show up as five boxes with the numbers "010337 01033F 01033D 010333 010343" in them, is replaced with the sequence "?df37?df3f?df3d?df33?df43", where ? is not the question mark but a black diamond with a white question mark in it (a zero byte?).
Do any of you have pointers on information as to what is going on?
We are trying to devise a workaround that would result in revisions like this comparing identical.
The problem seems to be due to a bug in an old version of a Python library. See e.g.
http://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/Inciden...
There are a number of reports in the bot's talk page. Apparently it took a block and a WP:AN/I report before it was fixed. It looks like it's converting a surrogate pair to a replacement character (U+FFFD) and a hexadecimal codepoint.
-- Tim Starling
wikitech-l@lists.wikimedia.org