Re: [Mediawiki-l] Re: Re: accents not appearing correctly

7 Feb 2006


      "Extended ASCII" is "accepted" and thus exists, regardless it came from the
ASCII board or not. The fact that bytes are 8-bits and almost everything in
computers is in bytes or multiples thereof has created this nightmare of
8-bit encodings we're still suffering today. IBM's first extension is what
many people call "extended ASCII" we like it or not, and that is what I was
talking about. Namely the DOS representation of the higher 128 codes. It
came with "IBM PC".
There is no such thing blahblah can be true if you ignore the gazillion
lines of legacy code thinking otherwise.
I agree with you it's unwise to assume programs will map your non-ASCII
right, but since many do it's a common thing. 99% (in ammount) of things in
1 byte are latin-1. The other "important" languages are impossible to
represent in 1 byte anyway except for arabic and hebrew, but those are
usually isolated from our "computer isle" in the west. For instance, 90% of
the "interweb" that isn't Chinese or Japanese, it belongs to a latin-1
covered language.
...
False; UTF-8 is meant to be compatible with 7-bit ASCII and the treatment
of
...
specially meaningful bytes such as 0 and the '/' path separator in string
handling in Unix-like environments. (It was created for Bell Labs' Plan 9
operating system, an experimental successor to Unix.)
That it happens to also be compact for English is nice, too.
In the real world that "happens to be compact for English" is crucial. Unix
was developed "in English" mainly, and therefore encoding of the English
language plus some extra codes was all that fell into consideration. They
simply didn't need to comment code in Japanese. What you say is factually
true, but I was just pointing out the most important reason in regards to
the topic at hand. By saying "UTF8 is meant to display English correctly" I
didn't imply it isn't meant to do anything else or that was the basis of it.
I could have said "UTF8 is meant to encode English correctly and
effectively, among other things" but I just didn't want to shift the focus.
...
...
and others default to 8859-1, also known as "latin 1" ;
That part is reasonably true for Windows operating systems and some older
Unix/Linux systems in North America and western Europe.
Mac OS X and most modern Linux systems default to UTF-8.
Yeah, but he his editor of choice most probably isn't, or there wouldn't be
a problem in the first place. Let's keep the focus.
It's a very common scenario, for minor changes, that people connect via
TELNET or SSH and quickly edit something directly in their test server,
instead of editing locally and then uploading FTP (or using some FTP capable
editor like gvim with the FTP plugin for instance). It's also very common
that consoles are set to ISO-8859-1 and thus vi, pico or nano will use that.
Can also happen that it's a shared environment and the user just can't
install stuff... also many telnet/ssh clients are not UTF8 compatible, or he
may have any sort of configuration problem I can't even imagine now. Shit
happens.
...
...
UTF8 is, by the way, not the best encoding for Asian text.
That depends on what you mean by "best". If by "best" you mean only "as
compact
...
as possible for the particular data I want to use at the moment" then yes,
there
...
are other encodings which are more compact.
If, however, compatibility is an issue, UTF-8 is extremely functional and
works
...
very well with UNIX/C-style string handling, pathnames, and byte-oriented
communications protocols at a minor 50% increase in uncompressed size for
such
...
languages.
If space were an issue, though, you'd be using data compression.
Compatibility is always an issue I'm afraid, and for this project UTF-8 is
IMO the best choice, if we have to stick to just one encoding. For Wikipedia
this is undoubtedly true. Other Wikis I'm sure they'd use a different thing.
But it still "Just Works" so I'm not complaining. It also makes things nice
for the developers because many IDEs and editors support UTF-8 out of the
box.
But, of course, space is always an issue. Using data compression has an
impact in processor performance. Having a better encoding for your text is
"compression without processing penalty" to put it in layman terms, and
having to retrieve more data slows down your wiki for several reasons: more
data to retrieve from the database and more bandwidth needed/longer
transmission time. For instance, for the average Japanese wiki it would save
30% space in server, 15-20% in bandwidth even with mod-gz, 30% better memory
usage in database caching (caching is good for mediawiki as you know better
than me for sure) - equivalent to have 30%+ memory for caching. Those are
rough figures. I'm not asking you to change this, as it would involve a lot
of time I'm sure you can use, just to keep it into consideration if at some
point you had time to support more than one encoding for mediawiki. Many
wikis hardly use any image at all, and when they do, they keep it somewhere
out of the database (haven't looked this in mediawiki, are you storing them
in BLOBs?).
So, "UTF8 is not the best for Asian text" as in, "by using exclusively UTF8,
you're bogging your performance down 20%+ for many people" . And extra
tweaks are not realistic for the joe-wiki-admin who most probably won't have
caching at all.
This is not a critique. For me, the wiki works well, it's fast enough and
UTF8 happens to suit me fine. This direction just keeps mediawiki from being
more popular in Asia. Stability and functionality are over performance in my
consideration list.
On 2/7/06, Brion Vibber brion@pobox.com wrote:
...
muyuubyou wrote:
...
A couple of mistakes there.
There is a difference between 'é' and '木' for many editors, including
non-windows editors that default to ASCII. The character 'é' is indeed
ASCII, it's just not 7-bit ASCII but "extended ASCII" or "8-bit ASCII" .
False. ASCII is 7-bits only. Anything that's 8 bits is *not* ASCII, but
some
other encoding.
Many/most 8-bit character encodings other than EBCDIC are *supersets* of
ASCII,
which incorporate the 7-bit ASCII character set in the lower 128 code
points and
various other characters in the high 128 code points.
Many people erroneously call any mapping from a number to a character that
can
fit in 8 bits an "ASCII code", however this is incorrect.
...
Many popular editors default to 8-bit ASCII,
There's no such thing.
...
and others default to 8859-1, also known as "latin 1" ;
That part is reasonably true for Windows operating systems and some older
Unix/Linux systems in North America and western Europe.
Mac OS X and most modern Linux systems default to UTF-8.
...
ASCII values from 128 on,
No such thing; there are no ASCII values from 128 on. However many 8-bit
character encodings which are supersets of ASCII contain *non*-ASCII
characters
in the 128-256 range. Since these represent wildly different characters
for each
such encoding (perhaps an accented Latin letter, perhaps a Greek letter,
perhaps
a Thai letter, perhaps an Arabic letter...) it's unwise to simply assume
that it
will have any meaning in a program that doesn't know about your favorite
encoding selection.
...
UTF8 is, by the way, not the best encoding for Asian text.
That depends on what you mean by "best". If by "best" you mean only "as
compact
as possible for the particular data I want to use at the moment" then yes,
there
are other encodings which are more compact.
If, however, compatibility is an issue, UTF-8 is extremely functional and
works
very well with UNIX/C-style string handling, pathnames, and byte-oriented
communications protocols at a minor 50% increase in uncompressed size for
such
languages.
If space were an issue, though, you'd be using data compression.
...
UTF8 is meant to
display English text effectively (1 byte)
False; UTF-8 is meant to be compatible with 7-bit ASCII and the treatment
of
specially meaningful bytes such as 0 and the '/' path separator in string
handling in Unix-like environments. (It was created for Bell Labs' Plan 9
operating system, an experimental successor to Unix.)
That it happens to also be compact for English is nice, too.
...
It would be very nice
to have an UTF16 version, which would only take 2-bytes for each
character
...
most of the time, 33%+- better space-wise.
Much of the time, the raw amount of space taken up by text files is fairly
insignificant. Text is small compared to image and multimedia data, and it
compresses very well.
Modern memory and hard disk prices strongly favor accessibility and
compatibility in most cases over squeezing a few percentage points out of
uncompressed text size.
-- brion vibber (brion @ pobox.com)

MediaWiki-l mailing list
MediaWiki-l@Wikimedia.org
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Re: [Mediawiki-l] Re: Re: accents not appearing correctly