Re: [Wikitech-l] character set problems in reverts

24 May 2007


      Reid Priedhorsky wrote:
...
Hi folks,
In our ongoing research here at UMN, we've discovered some reverts that 
introduce apparent character set problems; what seems to happen is that 
some Unicode characters are replaced by a character I don't recognize 
followed by a hexadecimal number. For example:
http://en.wikipedia.org/w/index.php?title=Dog&diff=58851026&oldid=58...
What I see is that a sequence of five characters that I don't have 
glyphs for, which show up as five boxes with the numbers "010337 01033F 
01033D 010333 010343" in them, is replaced with the sequence 
"?df37?df3f?df3d?df33?df43", where ? is not the question mark but a 
black diamond with a white question mark in it (a zero byte?).
Do any of you have pointers on information as to what is going on?
We are trying to devise a workaround that would result in revisions like 
this comparing identical.
The problem seems to be due to a bug in an old version of a Python 
library. See e.g.
http://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/Inciden...
There are a number of reports in the bot's talk page. Apparently it took 
a block and a WP:AN/I report before it was fixed. It looks like it's 
converting a surrogate pair to a replacement character (U+FFFD) and a 
hexadecimal codepoint.
-- Tim Starling

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] character set problems in reverts