Well, thanks for clearing *that up. -- %)
-S-
"Unicode" is a _character set_, which maps
abstract
numerical code
points to characters. Unicode code points (and hence
characters) may be
represented in a number of ways.
"UTF-8" is a _character encoding_, which maps
Unicode code points to
variable-length sequences of bytes. UTF-8's primary
feature is that it
is compatible with ASCII, which has made it popular
in Unix and internet
contexts as a more or less backwards-compatible way
of storing Unicode text.
"UTF-16" is another character encoding, which maps
Unicode code points
to 16-bit integers. (Or, sometimes, to two 16-bit
integers.) For
historical reasons and/or stupidity ;) UTF-16 (or
its evil elder sister
UCS-2) may get called "Unicode" by some software. If
you select
so-called "Unicode" encoding for a page that's
encoded in UTF-8, you'll
probably corrupt the display.
There are also many domain-specific ways of encoding
Unicode characters;
in HTML and XML (and SGML, if the document character
set is defined as
Unicode) you can use sequences such as 〹
(decimal) or ሴ
(hexadecimal). Because these only use ASCII
characters to do their dirty
work, they're robust through other character
encoding conversions and
can be typed in any text editor (if you know the
numbers). However they
are specific to that type of markup language, take
up more space than
binary encodings, and don't necessarily survive
forms well if let
through unencoded.
-- brion vibber (brion @
pobox.com)
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)wikipedia.org
__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!