[Breaking this thread off...]
On 12/28/08 1:32 AM, Niklas Laxström wrote:
The anchors of non-latin headers are already (latin) gibberish: #.D0.A4.D0.B8.D0.BB.D1.8C.D0.BC.D0.BE.D0.B3.D1.80.D0.B0.D1.84.D0.B8.D1.8F
It doesn't seem reasonable to think that people could create anchors in their head from text, except in special cases.
If we're going to stick with strict ASCII-limited anchors, it might be worth considering making them more legible, say with transliteration to ASCII Latin chars. :P
On the other hand, XHTML *doesn't* actually limit us this way!
The XHTML 1.0 recommendation of restriction to [A-Za-z][A-Za-z0-9:_.-]* is for compatibility with HTML 4.0, which defines:
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").
XHTML specifcies ID and NMTOKEN types here, which are *not* restricted to ASCII, but rather a large number of scripts:
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-NameChar
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Letter http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Digit http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Extender
If there are no major browser compatibility problems, I would probably recommend we roll back the nasty old .XX encoding for HTML 4 compatibility, in which case we could quite legally produce something direct, such as:
http://ru.wikipedia.org/wiki/%D0%A3%D0%BF%D0%BB%D0%B8%D1%81%D1%86%D0%B8%D1%8...
which URL-encodes out to:
http://ru.wikipedia.org/wiki/%D0%A3%D0%BF%D0%BB%D0%B8%D1%81%D1%86%D0%B8%D1%8...
(which can be nicely displayed as pretty Unicode in the URL bar of modern browsers)
as opposed to the current:
http://ru.wikipedia.org/wiki/%D0%A3%D0%BF%D0%BB%D0%B8%D1%81%D1%86%D0%B8%D1%8...
-- brion