Paul Ebermann wrote:
"Brion VIBBER" skribis:
Note that _theoretically_ a legal UTF-8 sequence
could also be legal ISO
8859-1.
[eo] Cxu vere?
Mi pensis ke en la komenco de la dua duono de ISO-8859-1
estas kelkaj numeroj reservita (kontrola kodoj) -
128 gxis 159, se mi memoras gxuste. Tiuj estas la bitokoj
de la formo 100xxxxx, kiuj ja povas aperi en UTF-8 (en
la dua aux sekvaj bitokoj de UTF-8-kodita signo).
Jes ja, sed ne cxiuj UTF-8-kodoj trovigxas en la gamo rezervita; se la
sekva(j) bitoko(j) formas laux 101xxxxx ili trovigxas en la gamo
160-191, kiu konsistigxas el diversaj punkciiloj kaj simboloj. Ekzemple:
á -> á
0xC3 0xA1 -> 0x00E1
110(00011) 10(1000001) -> 0000000011100001
Malofta bitokaro en latino-1, certe, sed lauxnorma.
[en] Really?
I thought that at the start of the second half of
ISO-8859-1 some numbers are reserved (control codes) -
128 to 159, if I remember correctly. That are the octets
of the form 100xxxxx, which can occur in UTF-8 (in the
second or following octets of a UTF-8 encoded sign).
Sure, but not all UTF-8 codes will find themselves in the reserved
range; if the tail byte(s) are in the form 101xxxxx they'll be in the
160-191 range, which is populated by various punctuation marks and
symbols. For instance:
á -> á
0xC3 0xA1 -> 0x00E1
110(00011) 10(1000001) -> 0000000011100001
Not a terribly likely sequence of bytes in Latin-1, but it's legal.
-- brion vibber (brion @
pobox.com)