Mohamed Magdy wrote:
Reid Priedhorsky wrote:
Hi folks,
In our ongoing research here at UMN, we've discovered some reverts that introduce apparent character set problems; what seems to happen is that some Unicode characters are replaced by a character I don't recognize followed by a hexadecimal number. For example:
http://en.wikipedia.org/w/index.php?title=Dog&diff=58851026&oldid=58...
What I see is that a sequence of five characters that I don't have glyphs for, which show up as five boxes with the numbers "010337 01033F 01033D 010333 010343" in them, is replaced with the sequence "?df37?df3f?df3d?df33?df43", where ? is not the question mark but a black diamond with a white question mark in it (a zero byte?).
Do any of you have pointers on information as to what is going on?
We are trying to devise a workaround that would result in revisions like this comparing identical.
I think this problem is in the antivandalbot or in the machine running it.. that is an old edit btw..may be it is fixed now..
OK, that's good info. Thanks.
Re-reading my text above, I should clarify: we are making comparisons of revision text offline using our own custom software, and we would like pairs of revisions like the above to compare identical (after all, AntiVandalBot was trying to make a revert), but they don't because of this bug. I'm trying to scope the issue and devise a workaround.
As we're wandering through the historical dumps, it doesn't solve our problem that the bug has been fixed. :)
Take care,
Reid