Regarding UTF-8 support. Perhaps it would be better if I try to explain some of the problems I'm facing. For example I'm not tracking most frequently used English words (a, the, and, or …). In my opinion every language should be tweaked separately and that's why I'm suggesting to first test it on English Wikipedia. Also I don't have a problem with finding spaces in UTF-8 encoded strings and splitting it there. The problem is that some Unicode characters like ẅ (letter w with two dots on top, Unicode code 0x1E85) are used to write words and some Unicode characters such as ' (Left single quotation mark, Unicode code 0x2018) are used to separate words. Also I believe these characters could be encoded as HTML entities in Wikitext. As I'm tracking words I need to distinguish between these "character classes" as they are known in regular expressions (i.e. \w word character and \W non word character). If Tim Starling has a silver bullet that can solve these problems feel free to e-mail it to me. However in my opinion implementing that kind of UTF-8 support from scratch can be somewhat tricky business. The bottom line is that problems above *can* be solved but what I suggest is to try on English Wikipedia first to see how it's going to work in general and whether it's a useful feature. Support for other languages could and should be added later on one language at a time.
On 08/06/06, Rob Church robchur@gmail.com wrote:
On 08/06/06, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Gerard Meijssen wrote:
Hoi, This small Unicode issue is a show stopper. When software is suggested that only works on Latin script, you do not appreciate the amount of work that is done in other scripts using the MediaWiki software.
"You do not appreciate" - rather a confrontational tone, there. Who are we to assume that someone else doesn't appreciate the amount of effort put in elsewhere? It might be correct, but then again, there might be no specific bias against it.
Apart from that why would it be boring.. this is a technical list. Personally I am interested in two things as well, what other projects are you referring to and how you want to see this attribution done.
Apart from why what would be boring? The post was to get feedback, don't withhold it. I would imagine standard attribution for the code under GNU GPL blah blah blah. We won't be adding flashing banners, "Wikipedia now uses a feature from XYZ". Or are we to start crediting developers with individual features? "Thanks for clearing your watchlist, c/o Rob Church."
I discussed unicode support with the original poster on IRC. I couldn't get through to him that adding UTF-8 support to a PHP application is trivial,
My impression of the poster was that he didn't completely understand the whole UTF-8/Unicode/blah thing nor its implications, and looked somewhat confused.
and requires no special UTF-8 support within PHP itself. MediaWiki's UTF-8 support is mostly implemented from scratch using PHP's binary-safe string handling. My wikidiff2 module in C++ also contains a simple UTF-8 decoder within the word splitting routine. It's not difficult.
If the *idea* is found to be viable, adding the UTF-8 goodies will be trivial, and we'll put the damn effort in.
Rob Church _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l