Well, thanks for clearing *that up. -- %)
-S-
"Unicode" is a _character set_, which maps abstract numerical code points to characters. Unicode code points (and hence characters) may be represented in a number of ways.
"UTF-8" is a _character encoding_, which maps Unicode code points to variable-length sequences of bytes. UTF-8's primary feature is that it is compatible with ASCII, which has made it popular in Unix and internet contexts as a more or less backwards-compatible way of storing Unicode text.
"UTF-16" is another character encoding, which maps Unicode code points to 16-bit integers. (Or, sometimes, to two 16-bit integers.) For historical reasons and/or stupidity ;) UTF-16 (or its evil elder sister UCS-2) may get called "Unicode" by some software. If you select so-called "Unicode" encoding for a page that's encoded in UTF-8, you'll probably corrupt the display.
There are also many domain-specific ways of encoding Unicode characters; in HTML and XML (and SGML, if the document character set is defined as Unicode) you can use sequences such as 〹 (decimal) or ሴ (hexadecimal). Because these only use ASCII characters to do their dirty work, they're robust through other character encoding conversions and can be typed in any text editor (if you know the numbers). However they are specific to that type of markup language, take up more space than binary encodings, and don't necessarily survive forms well if let through unencoded.
-- brion vibber (brion @ pobox.com)
Wikitech-l mailing list Wikitech-l@wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
__________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com