On 12/28/08 3:35 PM, Aryeh Gregor wrote:
On Sun, Dec 28, 2008 at 5:57 PM, Brion
Vibber<brion(a)wikimedia.org> wrote:
XHTML specifcies ID and NMTOKEN types here, which
are *not* restricted
to ASCII, but rather a large number of scripts:
This sounds like an excellent idea. I tried in IE5 (on ies4linux),
Firefox 3, and Opera 9.something and all had no problem with this
trivial test page:
http://www.twcenter.net/~simetrical/tests/unicode_anchor.html
The W3C validator is happy with it too.
Woohoo!
Honestly I'm not sure why we went with the crappy ASCII encoding to
begin with other than spec rules lawyering about the HTML 4
compatibility section. Possibly there was some issue with Netscape 4 or
something back in the day?
Of course, we still *do* have to ensure that id's
don't start with any
of the following:
"-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
The ones specified as Unicode code points are all either combining
characters or -- strangely -- the character ยท MIDDLE DOT.
Should be easy enough to do -- the exact ranges of allowed characters
are specified, and a simple regex can strip out or tweak anything
disallowed.
There are also still a bunch of characters that
aren't allowed in id's
period -- I'd assume stuff like whitespace, some punctuation, and
reserved characters, although I didn't look closely at the classes in
question. And, of course, most ASCII punctuation is still not
allowed. I guess we can keep up our dot-encoding for this -- although
if so, we should encode dots as well, because currently the encoding
is lossy, which is unnecessary.
There's no real need to encode these IMHO; in nearly all scenarios it
would be more readable to strip them, just like we strip markup.
Lossiness isn't a problem as long as the result is useful and legible.
(Note we already have to handle uniqueness by appending a number for
duplicate section header names, so stripping characters from the
originals doesn't create a new problem there.)
For instance right now this section header:
== Broken Template in "[[Annapolis]]" ==
gives us this encoded fragment ID:
#Broken_Template_in_.22Annapolis.22
I'd rather just see this:
#Broken_Template_in_Annapolis
-- brion