Unicode normalization spot-checks - Wikitech-l

11 Nov 2004


      One of the fun new things in MediaWiki 1.4 is validation and 
normalization of UTF-8 text input. The wiki will strip out malformed 
and illegal UTF-8 sequences, and normalize combining character 
sequences to help avoid almost-but-not-quite-equal oddities. (See 
http://www.unicode.org/reports/tr15/ for background.)
I figured it would be wise to do some spot-checks of the existing 
databases to see just how much trouble we're in already... I checked 
the October 30 'cur' table dumps for the Russian, Portuguese, and 
Korean Wikipedias.
The normalization routine UtfNormal::cleanUp() does a first quick pass 
to strip malformed UTF-8 byte sequences (this is extra optimized for 
predominantly Latin or pure-ASCII text), and if any characters are 
found during this pass that might indicate a non-normalized string a 
slower, full normalization pass is conducted.
Portuguese (40422 pages):
  text requiring slow check:    193 (0.5%)
  non-normal or invalid text:    13 (0.0%)
  non-normal or invalid title:    7 (0.0%)
  non-normal or invalid comment:  1 (0.0%)
(All of the broken titles and the comment are illegal 8-bit latin-1 
names on image pages, from an upload bot.)
Russian (14733 pages):
  text requiring slow check:    571 (3.9%)
  non-normal or invalid text:    18 (0.1%)
  non-normal or invalid comment:  3 (0.0%)
(A lot of these are Greek text fragments with non-normalized accent 
characters.)
Korean (7998 pages):
  text requiring slow check:    780 (9.6%)
  non-normal or invalid text:   745 (9.3%)
(Most of these are Han characters which appear in a special 
'compatibility' duplicate encoding area, and are normalized to the 
standard unified Han encoding of the same character. Many more are the 
Greek (!?) middle-dot character being replaced by the Latin one, which 
is the preferred encoding. This seems to get used in the formatting of 
lists.)
The full normalization check of Korean text is the worst case for my 
code -- every syllable gets decomposed into constituent parts and 
reassembled, and it can add about a second to the save/preview time for 
a 30k article on my (otherwise unloaded) 2GHz Athlon. Not too awful all 
things considered (most articles are much shorter than 30kb, and less 
than 10% of Korean-language edits should not be a huge burden overall), 
but it should be able to do much better by running the slow pass on 
substrings around the 'maybe' points.
-- brion vibber (brion @ pobox.com)