muyuubyou wrote:
A couple of mistakes there.
There is a difference between 'é' and '木' for many editors, including non-windows editors that default to ASCII. The character 'é' is indeed ASCII, it's just not 7-bit ASCII but "extended ASCII" or "8-bit ASCII" .
False. ASCII is 7-bits only. Anything that's 8 bits is *not* ASCII, but some other encoding.
Many/most 8-bit character encodings other than EBCDIC are *supersets* of ASCII, which incorporate the 7-bit ASCII character set in the lower 128 code points and various other characters in the high 128 code points.
Many people erroneously call any mapping from a number to a character that can fit in 8 bits an "ASCII code", however this is incorrect.
Many popular editors default to 8-bit ASCII,
There's no such thing.
and others default to 8859-1, also known as "latin 1" ;
That part is reasonably true for Windows operating systems and some older Unix/Linux systems in North America and western Europe.
Mac OS X and most modern Linux systems default to UTF-8.
ASCII values from 128 on,
No such thing; there are no ASCII values from 128 on. However many 8-bit character encodings which are supersets of ASCII contain *non*-ASCII characters in the 128-256 range. Since these represent wildly different characters for each such encoding (perhaps an accented Latin letter, perhaps a Greek letter, perhaps a Thai letter, perhaps an Arabic letter...) it's unwise to simply assume that it will have any meaning in a program that doesn't know about your favorite encoding selection.
UTF8 is, by the way, not the best encoding for Asian text.
That depends on what you mean by "best". If by "best" you mean only "as compact as possible for the particular data I want to use at the moment" then yes, there are other encodings which are more compact.
If, however, compatibility is an issue, UTF-8 is extremely functional and works very well with UNIX/C-style string handling, pathnames, and byte-oriented communications protocols at a minor 50% increase in uncompressed size for such languages.
If space were an issue, though, you'd be using data compression.
UTF8 is meant to display English text effectively (1 byte)
False; UTF-8 is meant to be compatible with 7-bit ASCII and the treatment of specially meaningful bytes such as 0 and the '/' path separator in string handling in Unix-like environments. (It was created for Bell Labs' Plan 9 operating system, an experimental successor to Unix.)
That it happens to also be compact for English is nice, too.
It would be very nice to have an UTF16 version, which would only take 2-bytes for each character most of the time, 33%+- better space-wise.
Much of the time, the raw amount of space taken up by text files is fairly insignificant. Text is small compared to image and multimedia data, and it compresses very well.
Modern memory and hard disk prices strongly favor accessibility and compatibility in most cases over squeezing a few percentage points out of uncompressed text size.
-- brion vibber (brion @ pobox.com)