Re: [Wikitech-l] Arabikipedia

9 Jul 2003

steve vertigo wrote:

...
 correction -- the utf-8 seems to be only one that
works... I thought the bottom text on the frontpage
was unicode... 
 Terminology note:

"Unicode" is a _character set_, which maps abstract numerical code 
points to characters. Unicode code points (and hence characters) may be 
represented in a number of ways.

"UTF-8" is a _character encoding_, which maps Unicode code points to 
variable-length sequences of bytes. UTF-8's primary feature is that it 
is compatible with ASCII, which has made it popular in Unix and internet 
contexts as a more or less backwards-compatible way of storing Unicode text.

"UTF-16" is another character encoding, which maps Unicode code points 
to 16-bit integers. (Or, sometimes, to two 16-bit integers.) For 
historical reasons and/or stupidity ;) UTF-16 (or its evil elder sister 
UCS-2) may get called "Unicode" by some software. If you select 
so-called "Unicode" encoding for a page that's encoded in UTF-8, you'll

probably corrupt the display.

There are also many domain-specific ways of encoding Unicode characters; 
in HTML and XML (and SGML, if the document character set is defined as 
Unicode) you can use sequences such as &#12345; (decimal) or &#x1234; 
(hexadecimal). Because these only use ASCII characters to do their dirty 
work, they're robust through other character encoding conversions and 
can be typed in any text editor (if you know the numbers). However they 
are specific to that type of markup language, take up more space than 
binary encodings, and don't necessarily survive forms well if let 
through unencoded.

-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Arabikipedia