As I've mentioned before, I'm pretty sure it's the encoding hack I set up to keep ampersands in titles _in_ the titles instead of as raw ampersands that indicate the beginning of the next variable in the query string:
RewriteEngine On RewriteMap urlencode prg:/usr/local/bin/urlencode RewriteRule ^/wiki/(.*)$ /w/wiki.phtml?title=${urlencode:$1} [L]
If the hackish little external program should die or get out of sync, we end up with the wrong URLs. But this ugliness *shouldn't* be needed. We *should* be able to use the internal function that Apache provides for this...
You are mistaken that Apache is doing the wrong thing: ampersands are /not/ supposed to be urlencoded--they are valid and meaningful characters needed for URLs. But ampersands do need to be messed with for Wikipedia-specific reasons: since article titles must appear as values in the query string (which is separated by ampersands), they must be escaped somehow for that function. Also, the non-escaped ampersands in the URL must be HTML-escaped when they appear as attribute values, such as HREFs. These are both entirely separate issues, and the code formerly dealt with them correctly, although in a way that you didn't like. We may have to compromise; accept the double-encoding for ampersands that you removed for other characters. Either that, or come up with some other escaping mechanism for ampersands in titles.
lcrocker@nupedia.com wrote:
As I've mentioned before, I'm pretty sure it's the encoding hack I set up to keep ampersands in titles _in_ the titles instead of as raw ampersands that indicate the beginning of the next variable in the query string:
RewriteEngine On RewriteMap urlencode prg:/usr/local/bin/urlencode RewriteRule ^/wiki/(.*)$ /w/wiki.phtml?title=${urlencode:$1} [L]
If the hackish little external program should die or get out of sync, we end up with the wrong URLs. But this ugliness *shouldn't* be needed. We *should* be able to use the internal function that Apache provides for this...
You are mistaken that Apache is doing the wrong thing: ampersands are /not/ supposed to be urlencoded--they are valid and meaningful characters needed for URLs.
It's not the wrong thing in _all_ cases, but it's definitely the wrong thing for the case of "take an arbitrary string and put it as a value in a key=value pair in a URL-encoded query string", which is the main reason I would use such a function in URL-rewriting.
But ampersands do need to be messed with for Wikipedia-specific reasons: since article titles must appear as values in the query string (which is separated by ampersands), they must be escaped somehow for that function. Also, the non-escaped ampersands in the URL must be HTML-escaped when they appear as attribute values, such as HREFs. These are both entirely separate issues, and the code formerly dealt with them correctly, although in a way that you didn't like. We may have to compromise; accept the double-encoding for ampersands that you removed for other characters. Either that, or come up with some other escaping mechanism for ampersands in titles.
Aside from my general distaste of the double-encoding, it doesn't handle the case of manual input: someone who types http://www.wikipedia.com/wiki/AT&T into their URL bar shouldn't end up at [[AT]].
See attached patch for the Apache source which adds a rewrite map function which encodes ampersands only. It works nicely on my test server, but I don't want to mess with installing it on the main server; I'm not sure exactly how the compile configuration was set up, and I've done enough damage lately. :)
Once installed, the rewrite map can look like this:
RewriteEngine On RewriteMap urlencode int:ampencode RewriteRule ^/wiki/(.*)$ /w/wiki.phtml?title=${urlencode:$1} [L] ...
If it looks reasonable, please go ahead and set it up.
-- brion vibber (brion @ pobox.com)
--- orig/apache_1.3.26/src/modules/standard/mod_rewrite.h Wed Mar 13 13:05:34 2002 +++ apache_1.3.26/src/modules/standard/mod_rewrite.h Tue Oct 15 14:07:21 2002 @@ -447,6 +447,7 @@ static char *rewrite_mapfunc_toupper(request_rec *r, char *key); static char *rewrite_mapfunc_tolower(request_rec *r, char *key); static char *rewrite_mapfunc_escape(request_rec *r, char *key); +static char *rewrite_mapfunc_ampescape(request_rec *r, char *key); static char *rewrite_mapfunc_unescape(request_rec *r, char *key); static char *select_random_value_part(request_rec *r, char *value); static void rewrite_rand_init(void); --- orig/apache_1.3.26/src/modules/standard/mod_rewrite.c Wed May 29 10:39:23 2002 +++ apache_1.3.26/src/modules/standard/mod_rewrite.c Tue Oct 15 14:07:49 2002 @@ -502,6 +502,9 @@ else if (strcmp(a2+4, "unescape") == 0) { new->func = rewrite_mapfunc_unescape; } + else if (strcmp(a2+4, "ampescape") == 0) { + new->func = rewrite_mapfunc_ampescape; + } else if (sconf->state == ENGINE_ENABLED) { return ap_pstrcat(cmd->pool, "RewriteMap: internal map not found:", a2+4, NULL); @@ -2982,6 +2985,30 @@
value = ap_escape_uri(r->pool, key); return value; +} + +static char *rewrite_mapfunc_ampescape(request_rec *r, char *key) +{ + /* We only need to escape the ampersand */ + char *copy = ap_palloc(r->pool, 3 * strlen(key) + 3); + const unsigned char *s = (const unsigned char *)key; + unsigned char *d = (unsigned char *)copy; + unsigned c; + + while ((c = *s)) { + if (c == '&') { + *d++ = '%'; + *d++ = '2'; + *d++ = '6'; + } + else { + *d++ = c; + } + ++s; + } + *d = '\0'; + + return copy; }
static char *rewrite_mapfunc_unescape(request_rec *r, char *key)
I wrote:
Once installed, the rewrite map can look like this:
RewriteEngine On RewriteMap urlencode int:ampencode
...
+static char *rewrite_mapfunc_ampescape(request_rec *r, char *key);
Err, make that "RewriteMap urlencode int:ampescape".
BTW, the code for rewrite_mapfunc_ampescape() is mainly ripped out of the URI-encoding function ap_os_escape_path() in main/util.c
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org