Re: [Mediawiki-l] Web page source - "strange" characters - MediaWiki-l

List overview All Threads
Download

newer

Re: [Mediawiki-l] Web page source - "strange" characters

older

Autolink extension

Slow response problem

Léa Massiot

23 Mar 2010 23 Mar '10

11:50 a.m.

I am so sorry I bothered you with the previous two posts but problems occurred: proxy blocking, email error... Sniff! lmhelp2 is the same as lmhelp... Here is the post I wanted to submit: ----------------------------------------------------------------------------------------------- Hi nakohdo, Thank you for your precious help :) . I am using "Firefox": in "View -> Character Encoding", "Unicode (UTF-8)" was and is checked.

...

Try opening the robots.txt in your browser and change the encoding to

UTF-8. With both "Firefox" and "Internet Explorer" Chinese characters aren't displayed properly. (The character encoding being Unicode (UTF-8) in both cases).

...

right click, "Save target as..."

I have "Save Page As..." So I saved the page as "robots.txt" and opened it with "MS Word". It proposed me to choose a specific encoding: I chose Unicode (UTF-8). And it worked! I could see the Chinese characters and other foreign characters too!

...

The robots.txt file you mentioned in your first posting doesn't

provide a mechanism

...

for telling its encoding so the browser has to guess or take the

defaul settings. So, I added, at the beginning of the file "robots.txt", the following code (without the starting dashes): --<html> --<head> --<meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> --</head> --<body> and at the end: --</body> --</html> And I changed the file extension: robots.txt -> robots.html Then, I opened it with Firefox and IE and I got the Chinese characters and the other characters properly rendered too! Thanks for your brains :) . Sincerely, -- Lmhelp

Show replies by date

lmhelp2

24 Mar 24 Mar

9:41 a.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

Hi again, I am coming back... :) I am sorry :/ I haven't yet understood correctly how things work... Can you help me? Here is what I have been experimenting. ------ First try -------------------------------------- Let's consider the following XML file: ======================================================= <?xml version="1.0" encoding="UTF-8"?> <root> <title> Etoilé </title> </root> ======================================================= Nota: "Etoilé" (is the French word for "starry"). I used it because I needed a small letter with an accute (é)... If I open that file with "Firefox" I get the following message: ======================================================= XML Parsing Error: not well-formed Location: essai.xml Line Number 5, Column 9: Etoil? ---------------^ ======================================================= I don't understand why there is an error... How am I supposed to write it? ------ Second try ------------------------------------- Now, if I replace the "é" character with the code %E9: ======================================================= <?xml version="1.0" encoding="UTF-8"?> <root> <title> Etoil%E9 </title> </root> ======================================================= and open the file with "Firefox", I have no error anymore but here is what is displayed: ======================================================= <root> <title> Etoil%E9 </title> </root> ======================================================= ------ Third try: -------------------------------------- ======================================================= <?xml version="1.0" encoding="UTF-8"?> <root> <title> EtoilÃ© </title> </root> ======================================================= In that case, when I open the file with "Firefox", I get the small letter "e" with accute: ======================================================= <root> <title> Etoilé </title> </root> ======================================================= I don't understand: - I expected the two first tries to work - and the last one to fail. What happened is the exact opposite! I am totally confused. I don't even know how to ask my question properly. I think that I understand what "é" and %E9 are... but I do not understand what Ã© is... moreover it is two characters "Ã" and "©" instead of one... Thank you for your help :) . Best regards, -- Lmhelp -- View this message in context: http://old.nabble.com/Web-page-source---%22strange%22-characters-tp27999218… Sent from the WikiMedia General mailing list archive at Nabble.com.

Alexis Moinet

10:09 a.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

lmhelp2 wrote :

...

<?xml version="1.0" encoding="UTF-8"?>

using this alone is not enough. you also have to save the xml file with UTF-8 encoding before opening in firefox. Saving with ISO-8859 and opening in firefox as UTF8 would cause exactly all the errors you're describing. Hope it helps Alexis

Platonides

10:38 a.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

...

I don't understand: - I expected the two first tries to work - and the last one to fail. What happened is the exact opposite! I am totally confused. I don't even know how to ask my question properly. I think that I understand what "é" and %E9 are... but I do not understand what Ã© is... moreover it is two characters "Ã" and "©" instead of one... Thank you for your help :) . Best regards, -- Lmhelp

The letter é has the codepoint 0xE9 in Unicode. If the file is written in iso-8859-1, it is represented by just one byte: 0xE9 (é) If the file is written in utf-8, it is represented by two bytes: 0xC3 0xA9 (Ã©) If the file is written in utf-16, it is represented by two bytes: 0x00 0xA9 in utf-16 BE and 0xA9 0x00 in utf-16 LE. The line <?xml version="1.0" encoding="UTF-8"?> says "this file will be in utf-8". If you then write "Etoilé " as 0x45 0x74 0x6f 0x69 0x6c 0xe9 0x20, that makes invalid XML, since it should have been 0x45 0x74 0x6f 0x69 0x6c 0xc3 0xa9 0x20 (alternatively, you could have specified a different encoding in the prolog). The use of %E9 is just a trick for urls, since they may not allow a literal "é" there (this url é would be encoded in iso-8859). It only appears in robots.txt because it talks about urls.

Frank Ralf

11:25 a.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

...

Here is what I have been experimenting.

...

If I open that file with "Firefox" I get the following message: ======================================================= XML Parsing Error: not well-formed Location: essai.xml Line Number 5, Column 9: Etoil? ---------------^ ======================================================= I don't understand why there is an error... How am I supposed to write it?

For converting text files to UTF-8 encoding Microsoft's own Notepad is very reliable in my experience (even if other editors claim to do the same). hth Frank

lmhelp2

11:40 a.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

---------------------------------------------------------------------- Hi Alexis, Thank you, I hadn't realized... and "Platonides"'s post explains why...! ---------------------------------------------------------------------- Hi Platonides, Thanks a lot for your explanations and examples! Line 1: "E t o i l é <space>" Line 2: 0x45 0x74 0x6f 0x69 0x6c 0xe9 0x20 Line 3: 0x45 0x74 0x6f 0x69 0x6c 0xc3 0xa9 0x20 Do we say: ----- "Line 2" is the "iso-8859-1" representation of "Line 1"? ----- "Line 3" is the "utf-8" representation of "Line 1"? Question: shouldn't we have 7 * 2 "codepoints" instead of 8? Maybe you omitted them, didn't you? ----- "Line 1" is made of characters? ----- "Line 2" and "Line 3" are made of codepoints? Let's consider: Line 1: "E t o i l é <space>" Line 4: 0x00 0x45 0x00 0x74 0x00 0x6f 0x00 0x69 0x00 0x6c 0x00 0xe9 0x00 0x20 Line 5: 0x45 0x00 0x74 0x00 0x6f 0x00 0x69 0x00 0x6c 0x00 0xe9 0x00 0x20 0x00 ----- Is "Line 4" the "utf-16 BE" representation of "Line 1"? ----- Is "Line 5" the "utf-16 LE" representation of "Line 1"? Can you tell me where to find the various tables which allow one to find a given representation ("iso-8859-1", "utf-8", "utf-16 BE", "utf-16 LE") for a given "character"? I mean, how did you know that: - 0xe9 is the "iso-8859-1" representation of é? - 0xc3 0xa9 is the "utf-8" representation of é? - 0x00 0xe9 is the "utf-16 BE" representation of é? - 0xe9 0x00 is the "utf-16 LE" representation of é? (Apart from the fact that you are a super-pro :) of course). Please tell me if I misunderstood something and correct me if I didn't use the proper terminology :) . ---------------------------------------------------------------------- Hi Nakohdo, I was trying to do it with Vim... Thanks a lot for your help :) :) :) . All the best, -- Lmhelp -- View this message in context: http://old.nabble.com/Web-page-source---%22strange%22-characters-tp27999218… Sent from the WikiMedia General mailing list archive at Nabble.com.

Platonides

9:27 p.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

lmhelp2 wrote:

...

Yes.

...

----- "Line 3" is the "utf-8" representation of "Line 1"?

Yes.

...

----- "Line 2" and "Line 3" are made of codepoints?

Line 2 and three are textual representation of the hex codes of how Line 1 would be written in their encodings. A codepoint is a number which corresponds to a glyph. The character 'capital A' has the codepoint 65 for convention. We could all have agred instead to give it the codepoint 1, or 25.

...

Question: shouldn't we have 7 * 2 "codepoints" instead of 8? Maybe you omitted them, didn't you?

We have 7 codepoints, one per "letter". Note that this is independent of the encoding. If you are wondering why utf-8 uses 8 bytes instead of 14 (as would have been used by utf-16), that's the beauty of utf-8. It will only use one byte (like ASCII) for basic letters, it will use two for a text with diacritics, Greek, Hebrew..., which are generally used less frequently, three bytes for characters much much less frequent (like €), and four for really odd ones, like Egyptian Hieroglyphics. So it is quite compact, while still allowing the full Unicode. There are other representations like UCS-4 easier to understand (four bytes per character) but terribly inefficient.

...

----- "Line 1" is made of characters?

Yes. But character is often taken as synonim of byte, which in this thread it is not.

...

Let's consider: Line 1: "E t o i l é <space>" Line 4: 0x00 0x45 0x00 0x74 0x00 0x6f 0x00 0x69 0x00 0x6c 0x00 0xe9 0x00 0x20 Line 5: 0x45 0x00 0x74 0x00 0x6f 0x00 0x69 0x00 0x6c 0x00 0xe9 0x00 0x20 0x00 ----- Is "Line 4" the "utf-16 BE" representation of "Line 1"? ----- Is "Line 5" the "utf-16 LE" representation of "Line 1"?

Yes and yes.

...

Can you tell me where to find the various tables which allow one to find a given representation ("iso-8859-1", "utf-8", "utf-16 BE", "utf-16 LE") for a given "character"?

You may find this app useful http://www.ltg.ed.ac.uk/~richard/utf-8.cgi

...

I mean, how did you know that: - 0xe9 is the "iso-8859-1" representation of é?

You indirectly told me when mentioning the %E9 :)

...

- 0xc3 0xa9 is the "utf-8" representation of é?

I did echo é | hd in a utf-8 terminal.

...

- 0x00 0xe9 is the "utf-16 BE" representation of é? - 0xe9 0x00 is the "utf-16 LE" representation of é?

For low values, utf-16 is the same as the codepoint number, stored in two bytes. So almost always you end up placing the hex code of the codepoint plus a null byte (high order byte 0). If you store the number in Big Endian, the high part will appear first, else it will appear later. UCS-2 keeps the codepoint in two bytes and simply stores it (in big endian or little endian). Since that restricts the characters you could use (what, I can't store Phoenician in ucs-2??), utf-16 uses some special values (the surrogate pairs) to take four bytes instead of two and provide the full unicode.

...

(Apart from the fact that you are a super-pro :) of course).

Hehe, thanks :)

...

Please tell me if I misunderstood something and correct me if I didn't use the proper terminology :) .

lmhelp2

25 Mar 25 Mar

8:04 a.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

Hi Platonides (hehe), Thank you for your answers. In your "cgi" program (http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=0301&mode=hex) I think I understand "everything" apart from the way you compute the "Hex UTF-8 bytes" section of the table... Where does "CC 81" come from in the example above (cf. link supra) (Interpret as: Hex code point Type in a representation of a Unicode character: 0301). Thank you for your help :) . Best regards, -- Lmhelp -- View this message in context: http://old.nabble.com/Web-page-source---%22strange%22-characters-tp27999218… Sent from the WikiMedia General mailing list archive at Nabble.com.

lmhelp2

10:38 a.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

Hi everyone, Hi Platonides, Ok. Finding the "Hex UTF-8 bytes" representation of an "Hex code point" is not intuitive. In the link "http://www.cl.cam.ac.uk/~mgk25/unicode.html", faq "What is UTF-8?", I found some parts of answer to my question. Let's consider the "Hex code point" 0xC3. What is the sequence of bits used to represent that character as "Hex UTF-8 bytes"? The binary representation of 0xC3 is 1100 0011. The first bit of this byte being 1 (and not 0) we will use the following "pattern" with two bytes to represent that code: 110xxxxx 10xxxxxx and replace the "x" with the proper bits. To do it, we read the binary representation of 0xC3 from right to left: - 8th bit of 0xC3 binary representation: 1 Replace the 16th x in 110xxxxx 10xxxxxx with 1: 110xxxxx 10xxxxx1 - 7th bit of 0xC3 binary representation: 1 Replace the 15th x in 110xxxxx 10xxxxx1 with 1: 110xxxxx 10xxxx11 - 6th bit of 0xC3 binary representation: 0 Replace the 14th x in 110xxxxx 10xxxx11 with 0: 110xxxxx 10xxx011 - 0 110xxxxx 10xx0011 - 0 110xxxxx 10x00011 - 0 110xxxxx 10000011 - 1 110xxxx1 10000011 - 1 110xxx11 10000011 And replace the remaining "x" with zeros: 11000011 10000011 The hexadecimal representation of 11000011 is 0xC3. The hexadecimal representation of 10000011 is 0x83. Hence the "Hex UTF-8 bytes" representation of 0xC3 is 0xC3 0x83. Is that it? Thanks and all the best, -- Lmhelp -- View this message in context: http://old.nabble.com/Web-page-source---%22strange%22-characters-tp27999218… Sent from the WikiMedia General mailing list archive at Nabble.com.

Frank Ralf

11:04 a.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

...

Is that it?

You might find the most thorough (and reliable) explanation for your questions at http://www.unicode.org Cheers Frank

Platonides

1:20 p.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

...

Is that it?

Yes.

lmhelp2

1:32 p.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

Thanks. All the best, -- Lmhelp -- View this message in context: http://old.nabble.com/Web-page-source---%22strange%22-characters-tp27999218… Sent from the WikiMedia General mailing list archive at Nabble.com.

Raymond Wan

10:45 p.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

Hi all, Platonides wrote:

...

I'm not an UTF expert, but a minor point is that East Asian languages (Japanese and Chinese) fit into the "three byte" region (I think). I think their entire alphabet is in the 3-byte region. On the other hand, the other non-Unicode encodings (Shift-JIS, EUC-JP, GB*, ISO-2022) use exactly two bytes. So, by using UTF-8, the text increases by 50%. I can't speak for both countries -- only the very small part I'm aware of -- but many e-mail programs and web pages still seem to use two-byte encodings (which probably include ASCII as a subset). I feel that UTF-8 isn't catching on very fast here, but (1) I don't know if that's true in other countries and (2) I don't know if this 50% increase in size is the show-stopper... Ray (Someone feel free to correct me if I'm wrong...)

Chad

24 Mar 24 Mar

12:02 p.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

On Wed, Mar 24, 2010 at 10:25 AM, Frank Ralf <Frank.Ralf(a)gmx.net> wrote:

...

Here is what I have been experimenting.

For converting text files to UTF-8 encoding Microsoft's own Notepad is very reliable in my experience (even if other editors claim to do the same). hth Frank _______________________________________________ MediaWiki-l mailing list MediaWiki-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Except that it silently adds the BOM, which breaks a lot of things. -Chad

Frank Ralf

12:42 p.m.

New subject: [Mediawiki-l] Web page source - "strange" characters

...

Except that it silently adds the BOM, which breaks a lot of things.

Well, yes ... but others like Notepad++ even break my umlauts in the first place ;-) So for real Unicode work use a special Unicode text editor like SC UniPad (http://www.unipad.org). BTW I have written some more on the hassles of UTF-8 and BOM at http://docs.moodle.org/en/UTF-8_and_BOM Frank

5143

days inactive

5146

days old

mediawiki-l@lists.wikimedia.org

Manage subscription

14 comments

7 participants

tags (0)

participants (7)

Alexis Moinet
Chad
Frank Ralf
lmhelp2
Léa Massiot
Platonides
Raymond Wan