[Breaking this thread off...]
On 12/28/08 1:32 AM, Niklas Laxström wrote:
The anchors of non-latin headers are already (latin)
gibberish:
#.D0.A4.D0.B8.D0.BB.D1.8C.D0.BC.D0.BE.D0.B3.D1.80.D0.B0.D1.84.D0.B8.D1.8F
It doesn't seem reasonable to think that people could create anchors
in their head from text, except in special cases.
If we're going to stick with strict ASCII-limited anchors, it might be
worth considering making them more legible, say with transliteration to
ASCII Latin chars. :P
On the other hand, XHTML *doesn't* actually limit us this way!
The XHTML 1.0 recommendation of restriction to [A-Za-z][A-Za-z0-9:_.-]*
is for compatibility with HTML 4.0, which defines:
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
followed by any number of letters, digits ([0-9]), hyphens ("-"),
underscores ("_"), colons (":"), and periods (".").
XHTML specifcies ID and NMTOKEN types here, which are *not* restricted
to ASCII, but rather a large number of scripts:
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-NameChar
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Letter
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Digit
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Extender
If there are no major browser compatibility problems, I would probably
recommend we roll back the nasty old .XX encoding for HTML 4
compatibility, in which case we could quite legally produce something
direct, such as:
http://ru.wikipedia.org/wiki/Уплисцихе#Уплисцихе_в_средневековье
which URL-encodes out to:
http://ru.wikipedia.org/wiki/%D0%A3%D0%BF%D0%BB%D0%B8%D1%81%D1%86%D0%B8%D1%…
(which can be nicely displayed as pretty Unicode in the URL bar of
modern browsers)
as opposed to the current:
http://ru.wikipedia.org/wiki/%D0%A3%D0%BF%D0%BB%D0%B8%D1%81%D1%86%D0%B8%D1%…
-- brion