[Mediawiki-l] Web page source - "strange" characters
Raymond Wan
r.wan at aist.go.jp
Fri Mar 26 01:45:21 UTC 2010
Hi all,
Platonides wrote:
> We have 7 codepoints, one per "letter". Note that this is independent of
> the encoding.
> If you are wondering why utf-8 uses 8 bytes instead of 14 (as would have
> been used by utf-16), that's the beauty of utf-8. It will only use one
> byte (like ASCII) for basic letters, it will use two for a text with
> diacritics, Greek, Hebrew..., which are generally used less frequently,
> three bytes for characters much much less frequent (like €), and four
> for really odd ones, like Egyptian Hieroglyphics.
> So it is quite compact, while still allowing the full Unicode.
> There are other representations like UCS-4 easier to understand (four
> bytes per character) but terribly inefficient.
I'm not an UTF expert, but a minor point is that East Asian languages (Japanese and Chinese) fit
into the "three byte" region (I think). I think their entire alphabet is in the 3-byte region. On
the other hand, the other non-Unicode encodings (Shift-JIS, EUC-JP, GB*, ISO-2022) use exactly two
bytes. So, by using UTF-8, the text increases by 50%.
I can't speak for both countries -- only the very small part I'm aware of -- but many e-mail
programs and web pages still seem to use two-byte encodings (which probably include ASCII as a
subset). I feel that UTF-8 isn't catching on very fast here, but (1) I don't know if that's true in
other countries and (2) I don't know if this 50% increase in size is the show-stopper...
Ray
(Someone feel free to correct me if I'm wrong...)
More information about the MediaWiki-l
mailing list