On 12/28/08 3:35 PM, Aryeh Gregor wrote:
On Sun, Dec 28, 2008 at 5:57 PM, Brion Vibberbrion@wikimedia.org wrote:
XHTML specifcies ID and NMTOKEN types here, which are *not* restricted to ASCII, but rather a large number of scripts:
This sounds like an excellent idea. I tried in IE5 (on ies4linux), Firefox 3, and Opera 9.something and all had no problem with this trivial test page:
http://www.twcenter.net/~simetrical/tests/unicode_anchor.html
The W3C validator is happy with it too.
Woohoo!
Honestly I'm not sure why we went with the crappy ASCII encoding to begin with other than spec rules lawyering about the HTML 4 compatibility section. Possibly there was some issue with Netscape 4 or something back in the day?
Of course, we still *do* have to ensure that id's don't start with any of the following:
"-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
The ones specified as Unicode code points are all either combining characters or -- strangely -- the character ยท MIDDLE DOT.
Should be easy enough to do -- the exact ranges of allowed characters are specified, and a simple regex can strip out or tweak anything disallowed.
There are also still a bunch of characters that aren't allowed in id's period -- I'd assume stuff like whitespace, some punctuation, and reserved characters, although I didn't look closely at the classes in question. And, of course, most ASCII punctuation is still not allowed. I guess we can keep up our dot-encoding for this -- although if so, we should encode dots as well, because currently the encoding is lossy, which is unnecessary.
There's no real need to encode these IMHO; in nearly all scenarios it would be more readable to strip them, just like we strip markup. Lossiness isn't a problem as long as the result is useful and legible. (Note we already have to handle uniqueness by appending a number for duplicate section header names, so stripping characters from the originals doesn't create a new problem there.)
For instance right now this section header: == Broken Template in "[[Annapolis]]" ==
gives us this encoded fragment ID: #Broken_Template_in_.22Annapolis.22
I'd rather just see this: #Broken_Template_in_Annapolis
-- brion