On Sun, Dec 28, 2008 at 5:57 PM, Brion Vibber brion@wikimedia.org wrote:
If we're going to stick with strict ASCII-limited anchors, it might be worth considering making them more legible, say with transliteration to ASCII Latin chars. :P
On the other hand, XHTML *doesn't* actually limit us this way!
The XHTML 1.0 recommendation of restriction to [A-Za-z][A-Za-z0-9:_.-]* is for compatibility with HTML 4.0, which defines:
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").
XHTML specifcies ID and NMTOKEN types here, which are *not* restricted to ASCII, but rather a large number of scripts:
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-NameChar
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Letter http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Digit http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Extender
This sounds like an excellent idea. I tried in IE5 (on ies4linux), Firefox 3, and Opera 9.something and all had no problem with this trivial test page:
http://www.twcenter.net/~simetrical/tests/unicode_anchor.html
The W3C validator is happy with it too.
Of course, we still *do* have to ensure that id's don't start with any of the following:
"-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
The ones specified as Unicode code points are all either combining characters or -- strangely -- the character ยท MIDDLE DOT.
There are also still a bunch of characters that aren't allowed in id's period -- I'd assume stuff like whitespace, some punctuation, and reserved characters, although I didn't look closely at the classes in question. And, of course, most ASCII punctuation is still not allowed. I guess we can keep up our dot-encoding for this -- although if so, we should encode dots as well, because currently the encoding is lossy, which is unnecessary. (Actually, you'd have to fix the "prepend x" solution too, that adds more lossiness.)