A couple of mistakes there.
There is a difference between 'é' and '木' for many editors, including non-windows editors that default to ASCII. The character 'é' is indeed ASCII, it's just not 7-bit ASCII but "extended ASCII" or "8-bit ASCII" . Many popular editors default to 8-bit ASCII, and others default to 8859-1 , also known as "latin 1" ; some use "Windows encoding" which is not exactly the same thing, but it's close. There is also "Mac encoding" which is also close but again it's different. Those just to mention the "popular" ones.
ASCII values from 128 on, those used with the first bit set, are the problematic ones. UTF8 reserves those to indicate more bytes are needed for displaying a char. UTF8 is variable-length while all the others I've mentioned here are "1-byte-1-char" so to speak.
Brion is right about 'é' being not "UTF8 friendly" - meaning all lower ASCII ( 0-127 or, in hexadecimal, 0-7F) are encoded the same for all popular 8-bit representations and also UTF8.
In other words, Hugh will have to check the encoding of the file, and Brion is right about this not being a browser problem whatsoever.
Hope that helped, Hugh. Also read my email from yesterday where I tried to give you a solution instead of scolding you ;)
UTF8 is, by the way, not the best encoding for Asian text. UTF8 is meant to display English text effectively (1 byte) while still being able to map all Unicode. This is nice, but since all Japanese and Chinese characters (at least all I tried, I'd have to check the tables to make sure) take 3 BYTES OR MORE (sorry for shouting) that alone is reason enough to use another Wiki like the popular japanese pukiwiki (using EUC-Japanese), or others using typically SJIS or EUC-Japanese, EUC-Chinese, Big5 etc. It would be very nice to have an UTF16 version, which would only take 2-bytes for each character most of the time, 33%+- better space-wise. I'm aware it's bad to have just one thing more to care about (different encodings) so I really understand this is not being done. For me UTF8 is more or less okay, since my Wiki will be mixed latin1+asian text.
For those who made it to the end of this message, thanks for your patience :-) now back to my busy-ass life as game developer... I'm late to my commute.
On 2/6/06, Brion Vibber brion@pobox.com wrote:
Hugh Prior wrote:
"Brion Vibber" brion@pobox.com wrote:
Your text editor will have some sort of encoding setting. Use it.
Thanks for trying Brion.
However, in view of the actual problem though, this which you suggest
is,
sorry to say, complete nonsense.
That's only, sorry to say, because you have no idea what you're talking about.
$pageText = "Fédération";
It is not complex text.
It is not as if I am trying to input Chinese via a program into a wiki.
Actually, it's exactly like that. Your string contains two non-ASCII characters, which will need to be properly encoded or you'll get some data corruption. Specifically, they must be UTF-8 encoded.
There's *no* qualitative difference between "é" and something like "本"; both are non-ASCII characters and therefore must be properly encoded in the UTF-8 source file.
The symptoms you described are *exactly* the symptoms of a miscoded 8-bit ISO 8859-1 (or Windows "ANSI" or whatever they call it) character in what should be a UTF-8 text stream.
If you think that the code, being PHP, still has to be run by a browser,
I'm talking about the text editor you used to save the PHP source file containing literal strings. There's no "browser" involved in your problem.
-- brion vibber (brion @ pobox.com)
MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l