Fun with URL encoding - Wikitech-l

10 Oct 2002

I finally got around to ferreting out the URL-encoding problem, which 
could produce some URLs that were encoding correctly, but others that 
actually encoded the URL-encoded form. For instance, for a page titled 
'Anátomy?' we might see hrefs that are incorrectly double-encoded like 
these:

http://www.piclab.com/wikitest/wiki.phtml?title=An%25E1tomy%253F
http://www.piclab.com/wikitest/wiki.phtml?title=An%25E1tomy%253F&amp;ac…
http://www.piclab.com/wikitest/wiki.phtml?title=An%25E1tomy%253F&amp;ac…
http://www.piclab.com/wikitest/wiki.phtml?title=Talk:An%25E1tomy%253F&a…

as well as correct ones like:
http://www.piclab.com/wikitest/wiki.phtml?title=Special:Userlogin&amp;r…
http://www.piclab.com/wikitest/wiki.phtml?title=Special:Whatlinkshere&a…
http://www.piclab.com/wikitest/wiki.phtml?title=Special:Recentchangeslinked…

(Note that the &s in the URL appear here as &amp; because I copied them 
right out of the HTML source of the page, where they must appear that 
way to be legal HTML.)

A hackish redundant urldecode() in Title::newFromURL() was presumably 
added to catch the first case. (PHP decodes URL-variable before we get 
them, so it's not necessary on correctly-encoded URLs.) I'd prefer to 
remove it, but double-encoded URLs have been polluting the search 
engines for some time and we have to retain compatibility.

The culprit was wfLocalUrl(), which takes two parameters, a wiki page 
title and a section for additional URL bits; it URL-encodes the title, 
then tacks both onto the server's hostname... but the first one has 
already been encoded by Title::getPrefixedURL(), so we get the fubar'd 
double-encoding above.

The correct encoding remains up in the target=, returnto=, etc because 
the URL bits aren't encoded a second time (the &s can't be URL-encoded 
or they lose their meaning).

I've removed the redundant encoding from wfLocalUrl(); I haven't come 
across another mis-encoded URL on since, and I've been trying.

Additionally, I've added a check from Title::newFromURL() that checks 
the character encoding of links coming in from the outside; for a 
latin-1 wiki UTF-8 encoded links are detected and converted to latin-1, 
and on a UTF-8 wiki latin-1 links are detected and converted to UTF-8. 
(The check is done in Language::checkTitleEncoding() and can be 
customized by language; I've set up the Polish to detect Latin-2 and the 
Esperanto to detect X-surrogates, so they'll be able to retain 
compatibility with existing links once converted.)

This is needed for a couple reasons:

* Some browsers (notably Internet Explorer) send URLs encoded in UTF-8 
if you type them into the URL bar or follow a link that's not 
URL-encoded. Thus we were getting mis-encoded titles from time to time 
when someone typed a title with accented chars directly into the URL bar 
or followed links from differently-encoded external sites.

* This should help with linking between the various language wikis, with 
less need for manually adding URL-encoding to interlanguage links that 
cross encodings.

* As noted above, compatibility with old URLs on some wikis.

Note that _theoretically_ a legal UTF-8 sequence could also be legal ISO 
8859-1. (But not bloody likely -- an uppercase accented letter followed 
by a single high punctuation mark or symbol, or a lowercase accented 
letter followed by two or three high punctuation marks or symbols.) 
Title URLs aren't checked or converted if the referer matches our 
server, so one could still work with such a page; just set up a redirect 
from the converted form for the benefit of outside links.

-- brion vibber (brion @ pobox.com)