lmhelp2 wrote:
----------------------------------------------------------------------
Hi Alexis,
Thank you, I hadn't realized...
and "Platonides"'s post explains why...!
----------------------------------------------------------------------
Hi Platonides,
Thanks a lot for your explanations and examples!
Line 1: "E t o i l é <space>"
Line 2: 0x45 0x74 0x6f 0x69 0x6c 0xe9 0x20
Line 3: 0x45 0x74 0x6f 0x69 0x6c 0xc3 0xa9 0x20
Do we say:
----- "Line 2" is the "iso-8859-1" representation of "Line
1"?
Yes.
----- "Line 3" is the "utf-8"
representation of "Line 1"?
Yes.
----- "Line 2" and "Line 3" are
made of codepoints?
Line 2 and three are textual representation of the hex codes of how Line
1 would be written in their encodings.
A codepoint is a number which corresponds to a glyph. The character
'capital A' has the codepoint 65 for convention. We could all have agred
instead to give it the codepoint 1, or 25.
Question: shouldn't we have 7 * 2
"codepoints" instead of 8?
Maybe you omitted them, didn't you?
We have 7 codepoints, one per "letter". Note that this is independent of
the encoding.
If you are wondering why utf-8 uses 8 bytes instead of 14 (as would have
been used by utf-16), that's the beauty of utf-8. It will only use one
byte (like ASCII) for basic letters, it will use two for a text with
diacritics, Greek, Hebrew..., which are generally used less frequently,
three bytes for characters much much less frequent (like €), and four
for really odd ones, like Egyptian Hieroglyphics.
So it is quite compact, while still allowing the full Unicode.
There are other representations like UCS-4 easier to understand (four
bytes per character) but terribly inefficient.
----- "Line 1" is made of characters?
Yes. But character is often taken as synonim of byte, which in this
thread it is not.
Let's consider:
Line 1: "E t o i l
é <space>"
Line 4: 0x00 0x45 0x00 0x74 0x00 0x6f 0x00 0x69 0x00 0x6c 0x00 0xe9
0x00 0x20
Line 5: 0x45 0x00 0x74 0x00 0x6f 0x00 0x69 0x00 0x6c 0x00 0xe9 0x00
0x20 0x00
----- Is "Line 4" the "utf-16 BE" representation of "Line
1"?
----- Is "Line 5" the "utf-16 LE" representation of "Line
1"?
Yes and yes.
Can you tell me where to find the various tables which
allow one to find a given representation ("iso-8859-1",
"utf-8", "utf-16 BE", "utf-16 LE") for a given
"character"?
You may find this app useful
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi
I mean, how did you know that:
- 0xe9 is the "iso-8859-1" representation of é?
You indirectly told me
when mentioning the %E9 :)
- 0xc3 0xa9 is the "utf-8" representation of
é?
I did echo é | hd in a utf-8 terminal.
- 0x00 0xe9 is the "utf-16 BE"
representation of é?
- 0xe9 0x00 is the "utf-16 LE" representation of é?
For low values, utf-16 is the same as the codepoint number, stored in
two bytes. So almost always you end up placing the hex code of the
codepoint plus a null byte (high order byte 0).
If you store the number in Big Endian, the high part will appear first,
else it will appear later.
UCS-2 keeps the codepoint in two bytes and simply stores it (in big
endian or little endian). Since that restricts the characters you could
use (what, I can't store Phoenician in ucs-2??), utf-16 uses some
special values (the surrogate pairs) to take four bytes instead of two
and provide the full unicode.
(Apart from the fact that you are a super-pro :) of
course).
Hehe, thanks :)
Please tell me if I misunderstood something and
correct me if I
didn't use the proper terminology :) .