"Extended ASCII" is "accepted" and thus exists, regardless it came from the ASCII board or not. The fact that bytes are 8-bits and almost everything in computers is in bytes or multiples thereof has created this nightmare of 8-bit encodings we're still suffering today. IBM's first extension is what many people call "extended ASCII" we like it or not, and that is what I was talking about. Namely the DOS representation of the higher 128 codes. It came with "IBM PC".
There is no such thing blahblah can be true if you ignore the gazillion lines of legacy code thinking otherwise.
I agree with you it's unwise to assume programs will map your non-ASCII right, but since many do it's a common thing. 99% (in ammount) of things in 1 byte are latin-1. The other "important" languages are impossible to represent in 1 byte anyway except for arabic and hebrew, but those are usually isolated from our "computer isle" in the west. For instance, 90% of the "interweb" that isn't Chinese or Japanese, it belongs to a latin-1 covered language.
False; UTF-8 is meant to be compatible with 7-bit ASCII and the treatment
of
specially meaningful bytes such as 0 and the '/' path separator in string handling in Unix-like environments. (It was created for Bell Labs' Plan 9 operating system, an experimental successor to Unix.)
That it happens to also be compact for English is nice, too.
In the real world that "happens to be compact for English" is crucial. Unix was developed "in English" mainly, and therefore encoding of the English language plus some extra codes was all that fell into consideration. They simply didn't need to comment code in Japanese. What you say is factually true, but I was just pointing out the most important reason in regards to the topic at hand. By saying "UTF8 is meant to display English correctly" I didn't imply it isn't meant to do anything else or that was the basis of it. I could have said "UTF8 is meant to encode English correctly and effectively, among other things" but I just didn't want to shift the focus.
and others default to 8859-1, also known as "latin 1" ;
That part is reasonably true for Windows operating systems and some older Unix/Linux systems in North America and western Europe.
Mac OS X and most modern Linux systems default to UTF-8.
Yeah, but he his editor of choice most probably isn't, or there wouldn't be a problem in the first place. Let's keep the focus.
It's a very common scenario, for minor changes, that people connect via TELNET or SSH and quickly edit something directly in their test server, instead of editing locally and then uploading FTP (or using some FTP capable editor like gvim with the FTP plugin for instance). It's also very common that consoles are set to ISO-8859-1 and thus vi, pico or nano will use that. Can also happen that it's a shared environment and the user just can't install stuff... also many telnet/ssh clients are not UTF8 compatible, or he may have any sort of configuration problem I can't even imagine now. Shit happens.
UTF8 is, by the way, not the best encoding for Asian text.
That depends on what you mean by "best". If by "best" you mean only "as
compact
as possible for the particular data I want to use at the moment" then yes,
there
are other encodings which are more compact.
If, however, compatibility is an issue, UTF-8 is extremely functional and
works
very well with UNIX/C-style string handling, pathnames, and byte-oriented communications protocols at a minor 50% increase in uncompressed size for
such
languages.
If space were an issue, though, you'd be using data compression.
Compatibility is always an issue I'm afraid, and for this project UTF-8 is IMO the best choice, if we have to stick to just one encoding. For Wikipedia this is undoubtedly true. Other Wikis I'm sure they'd use a different thing. But it still "Just Works" so I'm not complaining. It also makes things nice for the developers because many IDEs and editors support UTF-8 out of the box.
But, of course, space is always an issue. Using data compression has an impact in processor performance. Having a better encoding for your text is "compression without processing penalty" to put it in layman terms, and having to retrieve more data slows down your wiki for several reasons: more data to retrieve from the database and more bandwidth needed/longer transmission time. For instance, for the average Japanese wiki it would save 30% space in server, 15-20% in bandwidth even with mod-gz, 30% better memory usage in database caching (caching is good for mediawiki as you know better than me for sure) - equivalent to have 30%+ memory for caching. Those are rough figures. I'm not asking you to change this, as it would involve a lot of time I'm sure you can use, just to keep it into consideration if at some point you had time to support more than one encoding for mediawiki. Many wikis hardly use any image at all, and when they do, they keep it somewhere out of the database (haven't looked this in mediawiki, are you storing them in BLOBs?).
So, "UTF8 is not the best for Asian text" as in, "by using exclusively UTF8, you're bogging your performance down 20%+ for many people" . And extra tweaks are not realistic for the joe-wiki-admin who most probably won't have caching at all.
This is not a critique. For me, the wiki works well, it's fast enough and UTF8 happens to suit me fine. This direction just keeps mediawiki from being more popular in Asia. Stability and functionality are over performance in my consideration list.
On 2/7/06, Brion Vibber brion@pobox.com wrote:
muyuubyou wrote:
A couple of mistakes there.
There is a difference between 'é' and '木' for many editors, including non-windows editors that default to ASCII. The character 'é' is indeed ASCII, it's just not 7-bit ASCII but "extended ASCII" or "8-bit ASCII" .
False. ASCII is 7-bits only. Anything that's 8 bits is *not* ASCII, but some other encoding.
Many/most 8-bit character encodings other than EBCDIC are *supersets* of ASCII, which incorporate the 7-bit ASCII character set in the lower 128 code points and various other characters in the high 128 code points.
Many people erroneously call any mapping from a number to a character that can fit in 8 bits an "ASCII code", however this is incorrect.
Many popular editors default to 8-bit ASCII,
There's no such thing.
and others default to 8859-1, also known as "latin 1" ;
That part is reasonably true for Windows operating systems and some older Unix/Linux systems in North America and western Europe.
Mac OS X and most modern Linux systems default to UTF-8.
ASCII values from 128 on,
No such thing; there are no ASCII values from 128 on. However many 8-bit character encodings which are supersets of ASCII contain *non*-ASCII characters in the 128-256 range. Since these represent wildly different characters for each such encoding (perhaps an accented Latin letter, perhaps a Greek letter, perhaps a Thai letter, perhaps an Arabic letter...) it's unwise to simply assume that it will have any meaning in a program that doesn't know about your favorite encoding selection.
UTF8 is, by the way, not the best encoding for Asian text.
That depends on what you mean by "best". If by "best" you mean only "as compact as possible for the particular data I want to use at the moment" then yes, there are other encodings which are more compact.
If, however, compatibility is an issue, UTF-8 is extremely functional and works very well with UNIX/C-style string handling, pathnames, and byte-oriented communications protocols at a minor 50% increase in uncompressed size for such languages.
If space were an issue, though, you'd be using data compression.
UTF8 is meant to display English text effectively (1 byte)
False; UTF-8 is meant to be compatible with 7-bit ASCII and the treatment of specially meaningful bytes such as 0 and the '/' path separator in string handling in Unix-like environments. (It was created for Bell Labs' Plan 9 operating system, an experimental successor to Unix.)
That it happens to also be compact for English is nice, too.
It would be very nice to have an UTF16 version, which would only take 2-bytes for each
character
most of the time, 33%+- better space-wise.
Much of the time, the raw amount of space taken up by text files is fairly insignificant. Text is small compared to image and multimedia data, and it compresses very well.
Modern memory and hard disk prices strongly favor accessibility and compatibility in most cases over squeezing a few percentage points out of uncompressed text size.
-- brion vibber (brion @ pobox.com)
MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l