Paul Ebermann wrote:
"Brion VIBBER" skribis:
Note that _theoretically_ a legal UTF-8 sequence could also be legal ISO 8859-1.
[eo] Cxu vere? Mi pensis ke en la komenco de la dua duono de ISO-8859-1 estas kelkaj numeroj reservita (kontrola kodoj) - 128 gxis 159, se mi memoras gxuste. Tiuj estas la bitokoj de la formo 100xxxxx, kiuj ja povas aperi en UTF-8 (en la dua aux sekvaj bitokoj de UTF-8-kodita signo).
Jes ja, sed ne cxiuj UTF-8-kodoj trovigxas en la gamo rezervita; se la sekva(j) bitoko(j) formas laux 101xxxxx ili trovigxas en la gamo 160-191, kiu konsistigxas el diversaj punkciiloj kaj simboloj. Ekzemple:
á -> á 0xC3 0xA1 -> 0x00E1 110(00011) 10(1000001) -> 0000000011100001
Malofta bitokaro en latino-1, certe, sed lauxnorma.
[en] Really? I thought that at the start of the second half of ISO-8859-1 some numbers are reserved (control codes) - 128 to 159, if I remember correctly. That are the octets of the form 100xxxxx, which can occur in UTF-8 (in the second or following octets of a UTF-8 encoded sign).
Sure, but not all UTF-8 codes will find themselves in the reserved range; if the tail byte(s) are in the form 101xxxxx they'll be in the 160-191 range, which is populated by various punctuation marks and symbols. For instance:
á -> á 0xC3 0xA1 -> 0x00E1 110(00011) 10(1000001) -> 0000000011100001
Not a terribly likely sequence of bytes in Latin-1, but it's legal.
-- brion vibber (brion @ pobox.com)