One of the fun new things in MediaWiki 1.4 is validation and normalization of UTF-8 text input. The wiki will strip out malformed and illegal UTF-8 sequences, and normalize combining character sequences to help avoid almost-but-not-quite-equal oddities. (See http://www.unicode.org/reports/tr15/ for background.)
I figured it would be wise to do some spot-checks of the existing databases to see just how much trouble we're in already... I checked the October 30 'cur' table dumps for the Russian, Portuguese, and Korean Wikipedias.
The normalization routine UtfNormal::cleanUp() does a first quick pass to strip malformed UTF-8 byte sequences (this is extra optimized for predominantly Latin or pure-ASCII text), and if any characters are found during this pass that might indicate a non-normalized string a slower, full normalization pass is conducted.
Portuguese (40422 pages): text requiring slow check: 193 (0.5%) non-normal or invalid text: 13 (0.0%) non-normal or invalid title: 7 (0.0%) non-normal or invalid comment: 1 (0.0%) (All of the broken titles and the comment are illegal 8-bit latin-1 names on image pages, from an upload bot.)
Russian (14733 pages): text requiring slow check: 571 (3.9%) non-normal or invalid text: 18 (0.1%) non-normal or invalid comment: 3 (0.0%) (A lot of these are Greek text fragments with non-normalized accent characters.)
Korean (7998 pages): text requiring slow check: 780 (9.6%) non-normal or invalid text: 745 (9.3%) (Most of these are Han characters which appear in a special 'compatibility' duplicate encoding area, and are normalized to the standard unified Han encoding of the same character. Many more are the Greek (!?) middle-dot character being replaced by the Latin one, which is the preferred encoding. This seems to get used in the formatting of lists.)
The full normalization check of Korean text is the worst case for my code -- every syllable gets decomposed into constituent parts and reassembled, and it can add about a second to the save/preview time for a 30k article on my (otherwise unloaded) 2GHz Athlon. Not too awful all things considered (most articles are much shorter than 30kb, and less than 10% of Korean-language edits should not be a huge burden overall), but it should be able to do much better by running the slow pass on substrings around the 'maybe' points.
-- brion vibber (brion @ pobox.com)
On Nov 11, 2004, at 12:49 AM, Brion Vibber wrote:
The full normalization check of Korean text is the worst case for my code -- every syllable gets decomposed into constituent parts and reassembled, and it can add about a second to the save/preview time for a 30k article on my (otherwise unloaded) 2GHz Athlon.
While I would like to improve this further, I've now got a PHP extension wrapping the ICU library's normalization function working, passing all my test cases, and most importantly *not* leaking memory.
The extension is much, _much_ faster than the PHP-based looping on worst cases, and moderately faster on best cases (roman text that's mostly ASCII and contains no dubious characters).
I've also fixed a memory leak in the wikidiff extension.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org