On 6/8/06, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Roman Nosov wrote:
Regarding UTF-8 support. Perhaps it would be better if I try to explain some of the problems I'm facing. For example I'm not tracking most frequently used English words (a, the, and, or …). In my opinion every language should be tweaked separately and that's why I'm suggesting to first test it on English Wikipedia. Also I don't have a problem with finding spaces in UTF-8 encoded strings and splitting it there. The problem is that some Unicode characters like ẅ (letter w with two dots on top, Unicode code 0x1E85) are used to write words and some Unicode characters such as ' (Left single quotation mark, Unicode code 0x2018) are used to separate words. Also I believe these characters could be encoded as HTML entities in Wikitext. As I'm tracking words I need to distinguish between these "character classes" as they are known in regular expressions (i.e. \w word character and \W non word character). If Tim Starling has a silver bullet that can solve these problems feel free to e-mail it to me. However in my opinion implementing that kind of UTF-8 support from scratch can be somewhat tricky business. The bottom line is that problems above *can* be solved but what I suggest is to try on English Wikipedia first to see how it's going to work in general and whether it's a useful feature. Support for other languages could and should be added later on one language at a time.
High-numbered punctuation characters are rare, the approach I took in wikidiff2 was to consider them part of the word. I considered all non-alphanumeric characters less than 0xc0 as word-splitting punctuation characters.
The Unicode character databases actually include information on which chars are letters, which are punctuation, etc. Some programming languages incorporate this into appropriate functions such as isletter(), ispunt() or the like. I believe Perl has them. I don't know whether PHP has them or not but if it doesn't that might be considered a bug.
There are three languages that I'm aware of that don't use spaces to separate words, and thus require special handling: Chinese, Japanese and Thai. They are the only ones that I was able to find while searching the web for word segmentation information, and nobody from any other language wiki has complained.
The other language I can think of that doesn't use spaces is Khmer but it doesn't have many fonts yet and so very few web sites if any and surely no wikis. Some other Southeast Asian scripts may fall into the same category.
Chinese and Japanese are adequately handled by doing character-level diffs -- I received lots of praise from the Japanese Wikipedia for this scheme. Chinese and Japanese word segmentation for search or machine translation is a much more difficult problem, but luckily solving it is unnecessary for diff formatting. Character-level diffs may well be superior anyway.
For Thai I am using character-level diffs, and although I haven't received any complaints from the Wikipedians, I believe this is less than ideal. Thai has lots of composing characters, so you often end up highlighting little dots on top of letters and the like. Really what is required here is dictionary-based word segmentation.
I believe there are free dictionary based word segmentation algorithms available for Thai. It's also known not to be perfect but I'm not aware of any free Thai word segmenters that do better than them.
Andrew Dunbar (hippietrail)
Our search engine is also next to useless on the Thai Wikipedia due to the lack of word segmentation. But that's not a problem Rowan has to solve.
Putting all that together, here's how I detect word characters in wikidiff2:
inline bool my_istext(int ch) { // Standard alphanumeric if ((ch >= '0' && ch <= '9') || (ch == '_') || (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z')) { return true; } // Punctuation and control characters if (ch < 0xc0) return false; // Thai, return false so it gets split up if (ch >= 0xe00 && ch <= 0xee7) return false; // Chinese/Japanese, same if (ch >= 0x3000 && ch <= 0x9fff) return false; if (ch >= 0x20000 && ch <= 0x2a000) return false; // Otherwise assume it's from a language that uses spaces return true; }
Now this might not be sounding "trivial" anymore. UTF-8 support is trivial, I'll stand by that, but supporting all the languages of the world is not so trivial. But as you can see, language support isn't as hard as you might think, because lots of research has already been done.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l