On dim, 2003-01-12 at 07:07, Tomasz Wegrzanowski wrote:
On Sun, Jan 12, 2003 at 06:48:08AM -0800, Toby Bartels wrote:
As for the forbidden numerical character entites from € to š, we can interpret them as if they came from Micro$oft (most likely) and convert them to whatever they should be (by table). (If any other forbidden numerical entities have common nonstandard uses, then we can adopt those as well as long as they translate to good Unicode.)
They translate to Unicode 128-154. Unicode 0-255 is identical with ISO-8859-1.
There are many other Unicode (and ISO-8859) characters that mean nothing, so this is not a problem.
Well, it's a problem when people trying to write the euro symbol, the French oe-ligature, or Slovene/Czech accented letters get mysterious high control characters instead of the characters that they typed legally in CP1252 (even though their browser shouldn't have given them the option, since it was told to use ISO 8859-1, it's not the users' fault).
I'd rather silently do input conversion from CP1252 to UTF-8 (thus preserving those nasty Microsoft extentions as good Unicode characters), and output conversion to ISO 8859-1 to keep with standards.
There's no legitimate use of ISO 8859-1's or Unicode's 128-154 range that I know of except conceivably in terminal control. In plaintext on the web, they're 100% useless, so if they show up it's safe to assume they're really CP1252.
-- brion vibber (brion @ pobox.com)