I am so sorry I bothered you with the previous two posts but problems occurred: proxy blocking, email error... Sniff! lmhelp2 is the same as lmhelp... Here is the post I wanted to submit: -----------------------------------------------------------------------------------------------
Hi nakohdo,
Thank you for your precious help :) .
I am using "Firefox": in "View -> Character Encoding", "Unicode (UTF-8)" was and is checked.
Try opening the robots.txt in your browser and change the encoding to
UTF-8. With both "Firefox" and "Internet Explorer" Chinese characters aren't displayed properly. (The character encoding being Unicode (UTF-8) in both cases).
right click, "Save target as..."
I have "Save Page As..." So I saved the page as "robots.txt" and opened it with "MS Word". It proposed me to choose a specific encoding: I chose Unicode (UTF-8). And it worked! I could see the Chinese characters and other foreign characters too!
The robots.txt file you mentioned in your first posting doesn't
provide a mechanism
for telling its encoding so the browser has to guess or take the
defaul settings. So, I added, at the beginning of the file "robots.txt", the following code (without the starting dashes): --<html> --<head> --<meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> --</head> --<body>
and at the end: --</body> --</html>
And I changed the file extension: robots.txt -> robots.html
Then, I opened it with Firefox and IE and I got the Chinese characters and the other characters properly rendered too!
Thanks for your brains :) . Sincerely,
Hi again,
I am coming back... :) I am sorry :/ I haven't yet understood correctly how things work... Can you help me?
Here is what I have been experimenting.
------ First try -------------------------------------- Let's consider the following XML file: ======================================================= <?xml version="1.0" encoding="UTF-8"?> <root> <title> Etoilé </title> </root> ======================================================= Nota: "Etoilé" (is the French word for "starry"). I used it because I needed a small letter with an accute (é)...
If I open that file with "Firefox" I get the following message: ======================================================= XML Parsing Error: not well-formed Location: essai.xml Line Number 5, Column 9: Etoil? ---------------^ ======================================================= I don't understand why there is an error... How am I supposed to write it?
------ Second try ------------------------------------- Now, if I replace the "é" character with the code %E9: ======================================================= <?xml version="1.0" encoding="UTF-8"?> <root> <title> Etoil%E9 </title> </root> ======================================================= and open the file with "Firefox", I have no error anymore but here is what is displayed: ======================================================= <root> <title> Etoil%E9 </title> </root> =======================================================
------ Third try: -------------------------------------- ======================================================= <?xml version="1.0" encoding="UTF-8"?> <root> <title> Etoilé </title> </root> ======================================================= In that case, when I open the file with "Firefox", I get the small letter "e" with accute: ======================================================= <root> <title> Etoilé </title> </root> =======================================================
I don't understand: - I expected the two first tries to work - and the last one to fail. What happened is the exact opposite! I am totally confused. I don't even know how to ask my question properly.
I think that I understand what "é" and %E9 are... but I do not understand what é is... moreover it is two characters "Ã" and "©" instead of one...
Thank you for your help :) . Best regards, -- Lmhelp
lmhelp2 wrote :
<?xml version="1.0" encoding="UTF-8"?>
using this alone is not enough. you also have to save the xml file with UTF-8 encoding before opening in firefox.
Saving with ISO-8859 and opening in firefox as UTF8 would cause exactly all the errors you're describing.
Hope it helps
Alexis
I don't understand:
- I expected the two first tries to work
- and the last one to fail.
What happened is the exact opposite! I am totally confused. I don't even know how to ask my question properly.
I think that I understand what "é" and %E9 are... but I do not understand what é is... moreover it is two characters "Ã" and "©" instead of one...
Thank you for your help :) . Best regards, -- Lmhelp
The letter é has the codepoint 0xE9 in Unicode. If the file is written in iso-8859-1, it is represented by just one byte: 0xE9 (é) If the file is written in utf-8, it is represented by two bytes: 0xC3 0xA9 (é) If the file is written in utf-16, it is represented by two bytes: 0x00 0xA9 in utf-16 BE and 0xA9 0x00 in utf-16 LE.
The line <?xml version="1.0" encoding="UTF-8"?> says "this file will be in utf-8". If you then write "Etoilé " as 0x45 0x74 0x6f 0x69 0x6c 0xe9 0x20, that makes invalid XML, since it should have been 0x45 0x74 0x6f 0x69 0x6c 0xc3 0xa9 0x20 (alternatively, you could have specified a different encoding in the prolog).
The use of %E9 is just a trick for urls, since they may not allow a literal "é" there (this url é would be encoded in iso-8859). It only appears in robots.txt because it talks about urls.
Here is what I have been experimenting.
If I open that file with "Firefox" I get the following message: ======================================================= XML Parsing Error: not well-formed Location: essai.xml Line Number 5, Column 9: Etoil? ---------------^ ======================================================= I don't understand why there is an error... How am I supposed to write it?
For converting text files to UTF-8 encoding Microsoft's own Notepad is very reliable in my experience (even if other editors claim to do the same).
hth Frank
---------------------------------------------------------------------- Hi Alexis,
Thank you, I hadn't realized... and "Platonides"'s post explains why...!
---------------------------------------------------------------------- Hi Platonides,
Thanks a lot for your explanations and examples!
Line 1: "E t o i l é <space>" Line 2: 0x45 0x74 0x6f 0x69 0x6c 0xe9 0x20 Line 3: 0x45 0x74 0x6f 0x69 0x6c 0xc3 0xa9 0x20
Do we say:
----- "Line 2" is the "iso-8859-1" representation of "Line 1"?
----- "Line 3" is the "utf-8" representation of "Line 1"? Question: shouldn't we have 7 * 2 "codepoints" instead of 8? Maybe you omitted them, didn't you?
----- "Line 1" is made of characters?
----- "Line 2" and "Line 3" are made of codepoints?
Let's consider:
Line 1: "E t o i l é <space>" Line 4: 0x00 0x45 0x00 0x74 0x00 0x6f 0x00 0x69 0x00 0x6c 0x00 0xe9 0x00 0x20 Line 5: 0x45 0x00 0x74 0x00 0x6f 0x00 0x69 0x00 0x6c 0x00 0xe9 0x00 0x20 0x00
----- Is "Line 4" the "utf-16 BE" representation of "Line 1"? ----- Is "Line 5" the "utf-16 LE" representation of "Line 1"?
Can you tell me where to find the various tables which allow one to find a given representation ("iso-8859-1", "utf-8", "utf-16 BE", "utf-16 LE") for a given "character"?
I mean, how did you know that: - 0xe9 is the "iso-8859-1" representation of é? - 0xc3 0xa9 is the "utf-8" representation of é? - 0x00 0xe9 is the "utf-16 BE" representation of é? - 0xe9 0x00 is the "utf-16 LE" representation of é? (Apart from the fact that you are a super-pro :) of course).
Please tell me if I misunderstood something and correct me if I didn't use the proper terminology :) .
---------------------------------------------------------------------- Hi Nakohdo,
I was trying to do it with Vim...
Thanks a lot for your help :) :) :) . All the best, -- Lmhelp
lmhelp2 wrote:
Hi Alexis,
Thank you, I hadn't realized... and "Platonides"'s post explains why...!
Hi Platonides,
Thanks a lot for your explanations and examples!
Line 1: "E t o i l é <space>" Line 2: 0x45 0x74 0x6f 0x69 0x6c 0xe9 0x20 Line 3: 0x45 0x74 0x6f 0x69 0x6c 0xc3 0xa9 0x20
Do we say:
----- "Line 2" is the "iso-8859-1" representation of "Line 1"?
Yes.
----- "Line 3" is the "utf-8" representation of "Line 1"?
Yes.
----- "Line 2" and "Line 3" are made of codepoints?
Line 2 and three are textual representation of the hex codes of how Line 1 would be written in their encodings.
A codepoint is a number which corresponds to a glyph. The character 'capital A' has the codepoint 65 for convention. We could all have agred instead to give it the codepoint 1, or 25.
Question: shouldn't we have 7 * 2 "codepoints" instead of 8? Maybe you omitted them, didn't you?
We have 7 codepoints, one per "letter". Note that this is independent of the encoding. If you are wondering why utf-8 uses 8 bytes instead of 14 (as would have been used by utf-16), that's the beauty of utf-8. It will only use one byte (like ASCII) for basic letters, it will use two for a text with diacritics, Greek, Hebrew..., which are generally used less frequently, three bytes for characters much much less frequent (like €), and four for really odd ones, like Egyptian Hieroglyphics. So it is quite compact, while still allowing the full Unicode. There are other representations like UCS-4 easier to understand (four bytes per character) but terribly inefficient.
----- "Line 1" is made of characters?
Yes. But character is often taken as synonim of byte, which in this thread it is not.
Let's consider:
Line 1: "E t o i l é <space>" Line 4: 0x00 0x45 0x00 0x74 0x00 0x6f 0x00 0x69 0x00 0x6c 0x00 0xe9 0x00 0x20 Line 5: 0x45 0x00 0x74 0x00 0x6f 0x00 0x69 0x00 0x6c 0x00 0xe9 0x00 0x20 0x00
----- Is "Line 4" the "utf-16 BE" representation of "Line 1"? ----- Is "Line 5" the "utf-16 LE" representation of "Line 1"?
Yes and yes.
Can you tell me where to find the various tables which allow one to find a given representation ("iso-8859-1", "utf-8", "utf-16 BE", "utf-16 LE") for a given "character"?
You may find this app useful http://www.ltg.ed.ac.uk/~richard/utf-8.cgi
I mean, how did you know that:
- 0xe9 is the "iso-8859-1" representation of é?
You indirectly told me when mentioning the %E9 :)
- 0xc3 0xa9 is the "utf-8" representation of é?
I did echo é | hd in a utf-8 terminal.
- 0x00 0xe9 is the "utf-16 BE" representation of é?
- 0xe9 0x00 is the "utf-16 LE" representation of é?
For low values, utf-16 is the same as the codepoint number, stored in two bytes. So almost always you end up placing the hex code of the codepoint plus a null byte (high order byte 0). If you store the number in Big Endian, the high part will appear first, else it will appear later.
UCS-2 keeps the codepoint in two bytes and simply stores it (in big endian or little endian). Since that restricts the characters you could use (what, I can't store Phoenician in ucs-2??), utf-16 uses some special values (the surrogate pairs) to take four bytes instead of two and provide the full unicode.
(Apart from the fact that you are a super-pro :) of course).
Hehe, thanks :)
Please tell me if I misunderstood something and correct me if I didn't use the proper terminology :) .
Hi Platonides (hehe),
Thank you for your answers.
In your "cgi" program (http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=0301&mode=hex) I think I understand "everything" apart from the way you compute the "Hex UTF-8 bytes" section of the table...
Where does "CC 81" come from in the example above (cf. link supra) (Interpret as: Hex code point Type in a representation of a Unicode character: 0301).
Thank you for your help :) . Best regards, -- Lmhelp
Hi everyone, Hi Platonides,
Ok. Finding the "Hex UTF-8 bytes" representation of an "Hex code point" is not intuitive.
In the link "http://www.cl.cam.ac.uk/~mgk25/unicode.html", faq "What is UTF-8?", I found some parts of answer to my question.
Let's consider the "Hex code point" 0xC3. What is the sequence of bits used to represent that character as "Hex UTF-8 bytes"?
The binary representation of 0xC3 is 1100 0011. The first bit of this byte being 1 (and not 0) we will use the following "pattern" with two bytes to represent that code: 110xxxxx 10xxxxxx and replace the "x" with the proper bits. To do it, we read the binary representation of 0xC3 from right to left:
- 8th bit of 0xC3 binary representation: 1 Replace the 16th x in 110xxxxx 10xxxxxx with 1: 110xxxxx 10xxxxx1
- 7th bit of 0xC3 binary representation: 1 Replace the 15th x in 110xxxxx 10xxxxx1 with 1: 110xxxxx 10xxxx11
- 6th bit of 0xC3 binary representation: 0 Replace the 14th x in 110xxxxx 10xxxx11 with 0: 110xxxxx 10xxx011
- 0 110xxxxx 10xx0011
- 0 110xxxxx 10x00011
- 0 110xxxxx 10000011
- 1 110xxxx1 10000011
- 1 110xxx11 10000011
And replace the remaining "x" with zeros: 11000011 10000011
The hexadecimal representation of 11000011 is 0xC3. The hexadecimal representation of 10000011 is 0x83.
Hence the "Hex UTF-8 bytes" representation of 0xC3 is 0xC3 0x83.
Is that it?
Thanks and all the best, -- Lmhelp
Is that it?
You might find the most thorough (and reliable) explanation for your questions at http://www.unicode.org
Cheers Frank
Thanks. All the best, -- Lmhelp
Hi all,
Platonides wrote:
We have 7 codepoints, one per "letter". Note that this is independent of the encoding. If you are wondering why utf-8 uses 8 bytes instead of 14 (as would have been used by utf-16), that's the beauty of utf-8. It will only use one byte (like ASCII) for basic letters, it will use two for a text with diacritics, Greek, Hebrew..., which are generally used less frequently, three bytes for characters much much less frequent (like €), and four for really odd ones, like Egyptian Hieroglyphics. So it is quite compact, while still allowing the full Unicode. There are other representations like UCS-4 easier to understand (four bytes per character) but terribly inefficient.
I'm not an UTF expert, but a minor point is that East Asian languages (Japanese and Chinese) fit into the "three byte" region (I think). I think their entire alphabet is in the 3-byte region. On the other hand, the other non-Unicode encodings (Shift-JIS, EUC-JP, GB*, ISO-2022) use exactly two bytes. So, by using UTF-8, the text increases by 50%.
I can't speak for both countries -- only the very small part I'm aware of -- but many e-mail programs and web pages still seem to use two-byte encodings (which probably include ASCII as a subset). I feel that UTF-8 isn't catching on very fast here, but (1) I don't know if that's true in other countries and (2) I don't know if this 50% increase in size is the show-stopper...
Ray (Someone feel free to correct me if I'm wrong...)
On Wed, Mar 24, 2010 at 10:25 AM, Frank Ralf Frank.Ralf@gmx.net wrote:
Here is what I have been experimenting.
If I open that file with "Firefox" I get the following message: ======================================================= XML Parsing Error: not well-formed Location: essai.xml Line Number 5, Column 9: Etoil? ---------------^ ======================================================= I don't understand why there is an error... How am I supposed to write it?
For converting text files to UTF-8 encoding Microsoft's own Notepad is very reliable in my experience (even if other editors claim to do the same).
hth Frank
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Except that it silently adds the BOM, which breaks a lot of things.
-Chad
Except that it silently adds the BOM, which breaks a lot of things.
Well, yes ... but others like Notepad++ even break my umlauts in the first place ;-)
So for real Unicode work use a special Unicode text editor like SC UniPad (http://www.unipad.org). BTW I have written some more on the hassles of UTF-8 and BOM at http://docs.moodle.org/en/UTF-8_and_BOM
Frank
mediawiki-l@lists.wikimedia.org