[Mediawiki-l] Re: Re: accents not appearing correctly

Brion Vibber brion at pobox.com
Tue Feb 7 08:42:18 UTC 2006

muyuubyou wrote:
> A couple of mistakes there.
> There is a difference between 'é' and '木' for many editors, including
> non-windows editors that default to ASCII. The character 'é' is indeed
> ASCII, it's just not 7-bit ASCII but "extended ASCII" or "8-bit ASCII" .

False. ASCII is 7-bits only. Anything that's 8 bits is *not* ASCII, but some
other encoding.

Many/most 8-bit character encodings other than EBCDIC are *supersets* of ASCII,
which incorporate the 7-bit ASCII character set in the lower 128 code points and
various other characters in the high 128 code points.

Many people erroneously call any mapping from a number to a character that can
fit in 8 bits an "ASCII code", however this is incorrect.

> Many popular editors default to 8-bit ASCII,

There's no such thing.

> and others default to 8859-1, also known as "latin 1" ;

That part is reasonably true for Windows operating systems and some older
Unix/Linux systems in North America and western Europe.

Mac OS X and most modern Linux systems default to UTF-8.

> ASCII values from 128 on,

No such thing; there are no ASCII values from 128 on. However many 8-bit
character encodings which are supersets of ASCII contain *non*-ASCII characters
in the 128-256 range. Since these represent wildly different characters for each
such encoding (perhaps an accented Latin letter, perhaps a Greek letter, perhaps
a Thai letter, perhaps an Arabic letter...) it's unwise to simply assume that it
will have any meaning in a program that doesn't know about your favorite
encoding selection.

> UTF8 is, by the way, not the best encoding for Asian text.

That depends on what you mean by "best". If by "best" you mean only "as compact
as possible for the particular data I want to use at the moment" then yes, there
are other encodings which are more compact.

If, however, compatibility is an issue, UTF-8 is extremely functional and works
very well with UNIX/C-style string handling, pathnames, and byte-oriented
communications protocols at a minor 50% increase in uncompressed size for such

If space were an issue, though, you'd be using data compression.

> UTF8 is meant to
> display English text effectively (1 byte)

False; UTF-8 is meant to be compatible with 7-bit ASCII and the treatment of
specially meaningful bytes such as 0 and the '/' path separator in string
handling in Unix-like environments. (It was created for Bell Labs' Plan 9
operating system, an experimental successor to Unix.)

That it happens to also be compact for English is nice, too.

> It would be very nice
> to have an UTF16 version, which would only take 2-bytes for each character
> most of the time, 33%+- better space-wise.

Much of the time, the raw amount of space taken up by text files is fairly
insignificant. Text is small compared to image and multimedia data, and it
compresses very well.

Modern memory and hard disk prices strongly favor accessibility and
compatibility in most cases over squeezing a few percentage points out of
uncompressed text size.

-- brion vibber (brion @ pobox.com)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : http://lists.wikimedia.org/pipermail/mediawiki-l/attachments/20060207/ef9ef898/attachment.pgp 

More information about the MediaWiki-l mailing list