Re: [Wikitech-l] New diff feature for MediaWiki

9 Jun 2006


      Well it looks like my question about why some quotation marks do break
words and others don't will remain unanswered ("rareness" of high
numbered punctuation doesn't make it part of a word) … Anyway if such
level of supporting UTF-8 is sufficient for Mediawiki then Unicode
issue is "solved". Unicode über alles.
On 08/06/06, Tim Starling t.starling@physics.unimelb.edu.au wrote:
...
Roman Nosov wrote:
...
Regarding UTF-8 support. Perhaps it would be better if I try to
explain some of the problems I'm facing. For example I'm not tracking
most frequently used English words (a, the, and, or …). In my opinion
every language should be tweaked separately and that's why I'm
suggesting to first test it on English Wikipedia.
Also I don't have a problem with finding spaces in UTF-8 encoded
strings and splitting it there. The problem is that some Unicode
characters like ẅ (letter w with two dots on top, Unicode code 0x1E85)
are used to write words and some Unicode characters such as ' (Left
single quotation mark, Unicode code 0x2018) are used to separate
words. Also I believe these characters could be encoded as HTML
entities in Wikitext.
As I'm tracking words I need to distinguish between these "character
classes" as they are known in regular expressions (i.e. \w word
character and \W non word character). If Tim Starling has a silver
bullet that can solve these problems feel free to e-mail it to me.
However in my opinion implementing that kind of UTF-8 support from
scratch can be somewhat tricky business.
The bottom line is that problems above *can* be solved but what I
suggest is to try on English Wikipedia first to see how it's going to
work in general and whether it's a useful feature. Support for other
languages could and should be added later on one language at a time.
High-numbered punctuation characters are rare, the approach I took in
wikidiff2 was to consider them part of the word. I considered all
non-alphanumeric characters less than 0xc0 as word-splitting punctuation
characters. There are three languages that I'm aware of that don't use
spaces to separate words, and thus require special handling: Chinese,
Japanese and Thai. They are the only ones that I was able to find while
searching the web for word segmentation information, and nobody from any
other language wiki has complained. Chinese and Japanese are adequately
handled by doing character-level diffs -- I received lots of praise from the
Japanese Wikipedia for this scheme. Chinese and Japanese word segmentation
for search or machine translation is a much more difficult problem, but
luckily solving it is unnecessary for diff formatting. Character-level diffs
may well be superior anyway.
For Thai I am using character-level diffs, and although I haven't received
any complaints from the Wikipedians, I believe this is less than ideal. Thai
has lots of composing characters, so you often end up highlighting little
dots on top of letters and the like. Really what is required here is
dictionary-based word segmentation. Our search engine is also next to
useless on the Thai Wikipedia due to the lack of word segmentation. But
that's not a problem Rowan has to solve.
Putting all that together, here's how I detect word characters in wikidiff2:
inline bool my_istext(int ch)
{
        // Standard alphanumeric
        if ((ch >= '0' && ch <= '9') ||
           (ch == '_') ||
           (ch >= 'A' && ch <= 'Z') ||
           (ch >= 'a' && ch <= 'z'))
        {
                return true;
        }
        // Punctuation and control characters
        if (ch < 0xc0) return false;
        // Thai, return false so it gets split up
        if (ch >= 0xe00 && ch <= 0xee7) return false;
        // Chinese/Japanese, same
        if (ch >= 0x3000 && ch <= 0x9fff) return false;
        if (ch >= 0x20000 && ch <= 0x2a000) return false;
        // Otherwise assume it's from a language that uses spaces
        return true;
}
Now this might not be sounding "trivial" anymore. UTF-8 support is trivial,
I'll stand by that, but supporting all the languages of the world is not so
trivial. But as you can see, language support isn't as hard as you might
think, because lots of research has already been done.
-- Tim Starling

Wikitech-l mailing list
Wikitech-l@wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] New diff feature for MediaWiki