I'm totally agree with Timwi – proper Unicode support is a requirement not a feature. However can someone tell me why PHP comes with no appropriate out-of-box support for such vital feature in 21 century? The root cause of my diff engine ignoring Unicode at the moment is because many PHP functions simply don't work with UTF-8 encoded strings. PHP team promises proper Unicode support only in version 6. Yeah I guess we are still in nineties …
However I think it's much better to honestly say upfront that Unicode isn't properly supported then to claim that it is. For example look no further than Wikipedia's current diff engine. Self-appointed Unicode expert Tim Starling brags that it is extremely easy to build UTF-8 support from scratch. Well let's check that.
For example if you use ordinary single quote (the one from damned latin-1, you can easily find it on your keyboard) to separate two words in wikipedia then no problems. Diff engine will see these two separate words. However if you use Left single quotation mark (Unicode code 0x2018, the one MS Word likes to use) to separate two words oops now these two words are treated as one.
Test Case for everyone to check:
Using ordinary single quote: First edit: One'two Second Edit: One'three Diff engine output: Correctly highlights words two and three
Using left single quotation mark (Unicode code 0x2018, you might need to type it rather than copy&paste it, of course all due to excellent Unicode support by each and every e-mail program): First edit: One'two Second Edit: One'three Diff engine output: Incorrectly highlights both strings
So my question to all Unicode Nazis here is why quote from latin-1 charset is treated *differently* from slightly different Unicode quote?
On 08/06/06, Rob Church robchur@gmail.com wrote:
On 08/06/06, Timwi timwi@gmx.net wrote:
It is already confrontational of a programmer to pretend the whole world could make do with Latin-1. It is one of the most devastating and accordingly infuriating assumptions that still prevails despite the fact that Unicode is decades old. We're in the 21st century; it is no longer appropriate to even start programming anything where any user-visible text is restricted to Latin-1 or any other 8-bit charset.
Of course, of course, I clean forgot. Because a quick proof of concept has to be PERFECT, doesn't it. Do excuse that little oversight.
It's not perfect yet. Get over it and give some feedback on the idea.
Rob Church _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l