Re: [Wikitech-l] Section anchor encoding

29 Dec 2008


      On Sun, Dec 28, 2008 at 5:57 PM, Brion Vibber brion@wikimedia.org wrote:
...
If we're going to stick with strict ASCII-limited anchors, it might be
worth considering making them more legible, say with transliteration to
ASCII Latin chars. :P
On the other hand, XHTML *doesn't* actually limit us this way!
The XHTML 1.0 recommendation of restriction to [A-Za-z][A-Za-z0-9:_.-]*
is for compatibility with HTML 4.0, which defines:
ID and NAME tokens must begin with a letter ([A-Za-z]) and may be
  followed by any number of letters, digits ([0-9]), hyphens ("-"),
  underscores ("_"), colons (":"), and periods (".").
XHTML specifcies ID and NMTOKEN types here, which are *not* restricted
to ASCII, but rather a large number of scripts:
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-NameChar
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Letter
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Digit
http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Extender
This sounds like an excellent idea.  I tried in IE5 (on ies4linux),
Firefox 3, and Opera 9.something and all had no problem with this
trivial test page:
http://www.twcenter.net/~simetrical/tests/unicode_anchor.html
The W3C validator is happy with it too.
Of course, we still *do* have to ensure that id's don't start with any
of the following:
"-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
The ones specified as Unicode code points are all either combining
characters or -- strangely -- the character · MIDDLE DOT.
There are also still a bunch of characters that aren't allowed in id's
period -- I'd assume stuff like whitespace, some punctuation, and
reserved characters, although I didn't look closely at the classes in
question.  And, of course, most ASCII punctuation is still not
allowed.  I guess we can keep up our dot-encoding for this -- although
if so, we should encode dots as well, because currently the encoding
is lossy, which is unnecessary.  (Actually, you'd have to fix the
"prepend x" solution too, that adds more lossiness.)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Section anchor encoding