On 6/8/06, Roman Nosov rnosov@gmail.com wrote:
Well it looks like my question about why some quotation marks do break words and others don't will remain unanswered ("rareness" of high numbered punctuation doesn't make it part of a word) … Anyway if such level of supporting UTF-8 is sufficient for Mediawiki then Unicode issue is "solved". Unicode über alles.
I think it was adequately explained - the reason why it isn't detected is because the algorithm doesn't know it's a seperation character. So it's not seperated. If the algorithm did know, it would be seperated properly.
So perhaps someone, like you, should submit a quick patch to that part of the diff engine, as outlined by Tim, that makes it properly interpret that code point. If there's a general rule or table in the Unicode standard then implementing that might be an even better option.
The unicode site, by the way, is www.unicode.org and you can find a database of unicode character properties here:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
with information on interpreting them here:
http://ftp.lanet.lv/ftp/mirror/unicode/3.2-Update/UnicodeData-3.2.0.html
Enjoy!