[Mediawiki-l] Re: Re: accents not appearing correctly

muyuubyou muyuubyou at gmail.com
Tue Feb 7 08:03:29 UTC 2006

A couple of mistakes there.

There is a difference between 'é' and '木' for many editors, including
non-windows editors that default to ASCII. The character 'é' is indeed
ASCII, it's just not 7-bit ASCII but "extended ASCII" or "8-bit ASCII" .
Many popular editors default to 8-bit ASCII, and others default to 8859-1 ,
also known as "latin 1" ; some use "Windows encoding" which is not exactly
the same thing, but it's close. There is also "Mac encoding" which is also
close but again it's different. Those just to mention the "popular" ones.

ASCII values from 128 on, those used with the first bit set, are the
problematic ones. UTF8 reserves those to indicate more bytes are needed for
displaying a char. UTF8 is variable-length while all the others I've
mentioned here are "1-byte-1-char" so to speak.

Brion is right about 'é' being not "UTF8 friendly" - meaning all lower ASCII
( 0-127 or, in hexadecimal, 0-7F) are encoded the same for all popular 8-bit
representations and also UTF8.

In other words, Hugh will have to check the encoding of the file, and Brion
is right about this not being a browser problem whatsoever.

Hope that helped, Hugh. Also read my email from yesterday where I tried to
give you a solution instead of scolding you ;)

UTF8 is, by the way, not the best encoding for Asian text. UTF8 is meant to
display English text effectively (1 byte) while still being able to map all
Unicode. This is nice, but since all Japanese and Chinese characters (at
least all I tried, I'd have to check the tables to make sure) take 3 BYTES
OR MORE (sorry for shouting) that alone is reason enough to use another Wiki
like the popular japanese pukiwiki (using EUC-Japanese), or others using
typically SJIS or EUC-Japanese, EUC-Chinese, Big5 etc. It would be very nice
to have an UTF16 version, which would only take 2-bytes for each character
most of the time, 33%+- better space-wise. I'm aware it's bad to have just
one thing more to care about (different encodings) so I really understand
this is not being done. For me UTF8 is more or less okay, since my Wiki will
be mixed latin1+asian text.

For those who made it to the end of this message, thanks for your patience
:-) now back to my busy-ass life as game developer... I'm late to my

On 2/6/06, Brion Vibber <brion at pobox.com> wrote:
> Hugh Prior wrote:
> > "Brion Vibber" <brion at pobox.com> wrote:
> >> Your text editor will have some sort of encoding setting. Use it.
> >
> > Thanks for trying Brion.
> >
> > However, in view of the actual problem though, this which you suggest
> is,
> > sorry to say, complete nonsense.
> That's only, sorry to say, because you have no idea what you're talking
> about.
> >  $pageText = "Fédération";
> >
> > It is not complex text.
> >
> >  It is not as if I am trying to input Chinese via a
> > program into a wiki.
> Actually, it's exactly like that. Your string contains two non-ASCII
> characters,
> which will need to be properly encoded or you'll get some data corruption.
> Specifically, they must be UTF-8 encoded.
> There's *no* qualitative difference between "é" and something like "本";
> both
> are non-ASCII characters and therefore must be properly encoded in the
> UTF-8
> source file.
> The symptoms you described are *exactly* the symptoms of a miscoded 8-bit
> 8859-1 (or Windows "ANSI" or whatever they call it) character in what
> should be
> a UTF-8 text stream.
> > If you think that the code, being PHP, still has to be run by a browser,
> I'm talking about the text editor you used to save the PHP source file
> containing literal strings. There's no "browser" involved in your problem.
> -- brion vibber (brion @ pobox.com)
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l at Wikimedia.org
> http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

More information about the MediaWiki-l mailing list