[Mediawiki-l] Web page source - "strange" characters

Raymond Wan r.wan at aist.go.jp
Fri Mar 26 01:45:21 UTC 2010


Hi all,


Platonides wrote:
> We have 7 codepoints, one per "letter". Note that this is independent of
> the encoding.
> If you are wondering why utf-8 uses 8 bytes instead of 14 (as would have
> been used by utf-16), that's the beauty of utf-8. It will only use one
> byte (like ASCII) for basic letters, it will use two for a text with
> diacritics, Greek, Hebrew..., which are generally used less frequently,
> three bytes for characters much much less frequent (like €), and four
> for really odd ones, like Egyptian Hieroglyphics.
> So it is quite compact, while still allowing the full Unicode.
> There are other representations like UCS-4 easier to understand (four
> bytes per character) but terribly inefficient.


I'm not an UTF expert, but a minor point is that East Asian languages (Japanese and Chinese) fit 
into the "three byte" region (I think).  I think their entire alphabet is in the 3-byte region.  On 
the other hand, the other non-Unicode encodings (Shift-JIS, EUC-JP, GB*, ISO-2022) use exactly two 
bytes.  So, by using UTF-8, the text increases by 50%.

I can't speak for both countries -- only the very small part I'm aware of -- but many e-mail 
programs and web pages still seem to use two-byte encodings (which probably include ASCII as a 
subset).  I feel that UTF-8 isn't catching on very fast here, but (1) I don't know if that's true in 
other countries and (2) I don't know if this 50% increase in size is the show-stopper...

Ray
(Someone feel free to correct me if I'm wrong...)




More information about the MediaWiki-l mailing list