Re: [Wikitech-l] Arabikipedia

10 Jul 2003

      Well, thanks for clearing *that up. -- %)
-S-
...
"Unicode" is a _character set_, which maps abstract
numerical code 
points to characters. Unicode code points (and hence
characters) may be 
represented in a number of ways.
"UTF-8" is a _character encoding_, which maps
Unicode code points to 
variable-length sequences of bytes. UTF-8's primary
feature is that it 
is compatible with ASCII, which has made it popular
in Unix and internet 
contexts as a more or less backwards-compatible way
of storing Unicode text.
"UTF-16" is another character encoding, which maps
Unicode code points 
to 16-bit integers. (Or, sometimes, to two 16-bit
integers.) For 
historical reasons and/or stupidity ;) UTF-16 (or
its evil elder sister 
UCS-2) may get called "Unicode" by some software. If
you select 
so-called "Unicode" encoding for a page that's
encoded in UTF-8, you'll 
probably corrupt the display.
There are also many domain-specific ways of encoding
Unicode characters; 
in HTML and XML (and SGML, if the document character
set is defined as 
Unicode) you can use sequences such as &#12345;
(decimal) or &#4660; 
(hexadecimal). Because these only use ASCII
characters to do their dirty 
work, they're robust through other character
encoding conversions and 
can be typed in any text editor (if you know the
numbers). However they 
are specific to that type of markup language, take
up more space than 
binary encodings, and don't necessarily survive
forms well if let 
through unencoded.
-- brion vibber (brion @ pobox.com)

Wikitech-l mailing list
Wikitech-l@wikipedia.org
http://mail.wikipedia.org/mailman/listinfo/wikitech-l
__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Arabikipedia