Re: [Wikitech-l] Section anchor encoding

30 Dec 2008


      On 12/28/08 3:35 PM, Aryeh Gregor wrote:
...
On Sun, Dec 28, 2008 at 5:57 PM, Brion Vibberbrion@wikimedia.org  wrote:
...
XHTML specifcies ID and NMTOKEN types here, which are *not* restricted
to ASCII, but rather a large number of scripts:
This sounds like an excellent idea.  I tried in IE5 (on ies4linux),
Firefox 3, and Opera 9.something and all had no problem with this
trivial test page:
http://www.twcenter.net/~simetrical/tests/unicode_anchor.html
The W3C validator is happy with it too.
Woohoo!
Honestly I'm not sure why we went with the crappy ASCII encoding to 
begin with other than spec rules lawyering about the HTML 4 
compatibility section. Possibly there was some issue with Netscape 4 or 
something back in the day?
...
Of course, we still *do* have to ensure that id's don't start with any
of the following:
"-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
The ones specified as Unicode code points are all either combining
characters or -- strangely -- the character · MIDDLE DOT.
Should be easy enough to do -- the exact ranges of allowed characters 
are specified, and a simple regex can strip out or tweak anything 
disallowed.
...
There are also still a bunch of characters that aren't allowed in id's
period -- I'd assume stuff like whitespace, some punctuation, and
reserved characters, although I didn't look closely at the classes in
question.  And, of course, most ASCII punctuation is still not
allowed.  I guess we can keep up our dot-encoding for this -- although
if so, we should encode dots as well, because currently the encoding
is lossy, which is unnecessary.
There's no real need to encode these IMHO; in nearly all scenarios it 
would be more readable to strip them, just like we strip markup. 
Lossiness isn't a problem as long as the result is useful and legible. 
(Note we already have to handle uniqueness by appending a number for 
duplicate section header names, so stripping characters from the 
originals doesn't create a new problem there.)
For instance right now this section header:
== Broken Template in "[[Annapolis]]" ==
gives us this encoded fragment ID:
#Broken_Template_in_.22Annapolis.22
I'd rather just see this:
#Broken_Template_in_Annapolis
-- brion

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Section anchor encoding