Section anchor encoding

List overview All Threads
Download

newer

older

Enwiki Dump Crawling since...

Subpage titles

Brion Vibber

28 Dec 2008 28 Dec '08

11:57 p.m.

[Breaking this thread off...]

On 12/28/08 1:32 AM, Niklas Laxström wrote:

...

The anchors of non-latin headers are already (latin) gibberish: #.D0.A4.D0.B8.D0.BB.D1.8C.D0.BC.D0.BE.D0.B3.D1.80.D0.B0.D1.84.D0.B8.D1.8F

It doesn't seem reasonable to think that people could create anchors in their head from text, except in special cases.

If we're going to stick with strict ASCII-limited anchors, it might be worth considering making them more legible, say with transliteration to ASCII Latin chars. :P

On the other hand, XHTML *doesn't* actually limit us this way!

The XHTML 1.0 recommendation of restriction to [A-Za-z][A-Za-z0-9:_.-]* is for compatibility with HTML 4.0, which defines:

ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").

XHTML specifcies ID and NMTOKEN types here, which are *not* restricted to ASCII, but rather a large number of scripts:

http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-NameChar

http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Letter http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Digit http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Extender

If there are no major browser compatibility problems, I would probably recommend we roll back the nasty old .XX encoding for HTML 4 compatibility, in which case we could quite legally produce something direct, such as:

http://ru.wikipedia.org/wiki/%D0%A3%D0%BF%D0%BB%D0%B8%D1%81%D1%86%D0%B8%D1%8...

which URL-encodes out to:

http://ru.wikipedia.org/wiki/%D0%A3%D0%BF%D0%BB%D0%B8%D1%81%D1%86%D0%B8%D1%8...

(which can be nicely displayed as pretty Unicode in the URL bar of modern browsers)

as opposed to the current:

http://ru.wikipedia.org/wiki/%D0%A3%D0%BF%D0%BB%D0%B8%D1%81%D1%86%D0%B8%D1%8...

-- brion

Show replies by date

Aryeh Gregor

29 Dec 29 Dec

12:35 a.m.

On Sun, Dec 28, 2008 at 5:57 PM, Brion Vibber brion@wikimedia.org wrote:

...

If we're going to stick with strict ASCII-limited anchors, it might be worth considering making them more legible, say with transliteration to ASCII Latin chars. :P

On the other hand, XHTML *doesn't* actually limit us this way!

The XHTML 1.0 recommendation of restriction to [A-Za-z][A-Za-z0-9:_.-]* is for compatibility with HTML 4.0, which defines:

ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").

XHTML specifcies ID and NMTOKEN types here, which are *not* restricted to ASCII, but rather a large number of scripts:

http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-NameChar

http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Letter http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Digit http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Extender

This sounds like an excellent idea. I tried in IE5 (on ies4linux), Firefox 3, and Opera 9.something and all had no problem with this trivial test page:

http://www.twcenter.net/~simetrical/tests/unicode_anchor.html

The W3C validator is happy with it too.

Of course, we still *do* have to ensure that id's don't start with any of the following:

"-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

The ones specified as Unicode code points are all either combining characters or -- strangely -- the character · MIDDLE DOT.

There are also still a bunch of characters that aren't allowed in id's period -- I'd assume stuff like whitespace, some punctuation, and reserved characters, although I didn't look closely at the classes in question. And, of course, most ASCII punctuation is still not allowed. I guess we can keep up our dot-encoding for this -- although if so, we should encode dots as well, because currently the encoding is lossy, which is unnecessary. (Actually, you'd have to fix the "prepend x" solution too, that adds more lossiness.)

Brion Vibber

7:11 p.m.

On 12/28/08 3:35 PM, Aryeh Gregor wrote:

...

On Sun, Dec 28, 2008 at 5:57 PM, Brion Vibberbrion@wikimedia.org wrote:

...
XHTML specifcies ID and NMTOKEN types here, which are *not* restricted to ASCII, but rather a large number of scripts:

This sounds like an excellent idea. I tried in IE5 (on ies4linux), Firefox 3, and Opera 9.something and all had no problem with this trivial test page:

http://www.twcenter.net/~simetrical/tests/unicode_anchor.html

The W3C validator is happy with it too.

Woohoo!

Honestly I'm not sure why we went with the crappy ASCII encoding to begin with other than spec rules lawyering about the HTML 4 compatibility section. Possibly there was some issue with Netscape 4 or something back in the day?

...

Of course, we still *do* have to ensure that id's don't start with any of the following:

"-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

The ones specified as Unicode code points are all either combining characters or -- strangely -- the character · MIDDLE DOT.

Should be easy enough to do -- the exact ranges of allowed characters are specified, and a simple regex can strip out or tweak anything disallowed.

...

There are also still a bunch of characters that aren't allowed in id's period -- I'd assume stuff like whitespace, some punctuation, and reserved characters, although I didn't look closely at the classes in question. And, of course, most ASCII punctuation is still not allowed. I guess we can keep up our dot-encoding for this -- although if so, we should encode dots as well, because currently the encoding is lossy, which is unnecessary.

There's no real need to encode these IMHO; in nearly all scenarios it would be more readable to strip them, just like we strip markup. Lossiness isn't a problem as long as the result is useful and legible. (Note we already have to handle uniqueness by appending a number for duplicate section header names, so stripping characters from the originals doesn't create a new problem there.)

For instance right now this section header: == Broken Template in "[[Annapolis]]" ==

gives us this encoded fragment ID: #Broken_Template_in_.22Annapolis.22

I'd rather just see this: #Broken_Template_in_Annapolis

-- brion

Aryeh Gregor

10:18 p.m.

On Mon, Dec 29, 2008 at 1:11 PM, Brion Vibber brion@wikimedia.org wrote:

...

There's no real need to encode these IMHO; in nearly all scenarios it would be more readable to strip them, just like we strip markup. Lossiness isn't a problem as long as the result is useful and legible. (Note we already have to handle uniqueness by appending a number for duplicate section header names, so stripping characters from the originals doesn't create a new problem there.)

Yeah, I was thinking about it and reached the same conclusion. Just replace any run of disallowed characters with underscores. If it starts with a character that's only valid in the middle, we might prefix an underscore instead of an x, while we're at it.

Aryeh Gregor

30 Dec 30 Dec

1:37 a.m.

On Sun, Dec 28, 2008 at 5:57 PM, Brion Vibber brion@wikimedia.org wrote:

...

If there are no major browser compatibility problems, I would probably recommend we roll back the nasty old .XX encoding for HTML 4 compatibility, in which case we could quite legally produce something direct, such as:

http://ru.wikipedia.org/wiki/%D0%A3%D0%BF%D0%BB%D0%B8%D1%81%D1%86%D0%B8%D1%8...

This now works in r45171. Currently it's an option, disabled by default, because it needs more testing and discussion. Unless you see a problem right away, I suggest enabling it on testwiki.

Brion Vibber

1 Jan 1 Jan

12:40 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Aryeh Gregor wrote:

...

On Sun, Dec 28, 2008 at 5:57 PM, Brion Vibber brion@wikimedia.org wrote:

...
If there are no major browser compatibility problems, I would probably recommend we roll back the nasty old .XX encoding for HTML 4 compatibility, in which case we could quite legally produce something direct, such as:

http://ru.wikipedia.org/wiki/%D0%A3%D0%BF%D0%BB%D0%B8%D1%81%D1%86%D0%B8%D1%8...

This now works in r45171. Currently it's an option, disabled by default, because it needs more testing and discussion. Unless you see a problem right away, I suggest enabling it on testwiki.

Hmm, the fragments aren't getting URL-encoded when put onto links:

Particularly when tossing them around in HTTP headers and such I'm pretty sure we should be consistently URL-encoding the UTF-8.

(In general handling of the fragment bit on Title objects is weird and horrifying, IMHO, with the '#' prefix and any encoding handled by whoever sets it... they probably _ought_ to be just passed in in source form and let the Title object worry about all the normalization and encoding when it makes a link.)

- -- brion

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAklcAtcACgkQwRnhpk1wk47iuACcDjVTxZ5TA1kBYV4jTQGqMLed Vf0AoLUMN1Rmzbz5IN23zcs2ACJDLw10 =Dvcg -----END PGP SIGNATURE-----

Aryeh Gregor

2 Jan 2 Jan

10:12 p.m.

Current status on this in trunk:

1) All links that I tested work on Firefox 3, IE5, IE5.5, IE6, and at least one version of recentish Opera. Still needs testing on IE7, IE8 beta, WebKit, and older Firefox.

2) There are no legacy anchors in place. In other words, old links will all break. This should be reasonably easy to fix, although more than two or three lines (need to maintain and check a second array in Parser::formatHeadings() to get the numbering right).

3) Redirects to section after edit work in Firefox, but fail in IE5, 5.5, and 6, whether the HTTP response URL is urlencoded or not. With (2) fixed, these could redirect to the legacy anchor in known broken user agents. This might require some refactoring.

Aryeh Gregor

5 Jan 5 Jan

5:37 p.m.

On Fri, Jan 2, 2009 at 4:12 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:

...

All links that I tested work on Firefox 3, IE5, IE5.5, IE6, and at

least one version of recentish Opera. Still needs testing on IE7, IE8 beta, WebKit, and older Firefox.

I've tested on IE7, and it works. I wasn't able to test on the others yet, but I doubt there will be any problems, except *possibly* in IE8 beta (hopefully not).

...

There are no legacy anchors in place. In other words, old links

will all break. This should be reasonably easy to fix, although more than two or three lines (need to maintain and check a second array in Parser::formatHeadings() to get the numbering right).

This is fixed in r45418.

...

Redirects to section after edit work in Firefox, but fail in IE5,

5.5, and 6, whether the HTTP response URL is urlencoded or not. With (2) fixed, these could redirect to the legacy anchor in known broken user agents. This might require some refactoring.

This is the last thing that needs to be fixed for the code to be usable, AFAIK.

5810

Age (days ago)

5818

Last active (days ago)

wikitech-l@lists.wikimedia.org

7 comments

2 participants

tags (0)

participants (2)

Aryeh Gregor
Brion Vibber