Hi all!
It seems to me that handling of non-ASCII characters in interwiki links (or in URLs in general) is a bit problematic. As an example, take [[en:Václav Havel]]. Since en: does not use UTF-8, the URL is ".../V%E1clav_Havel". If you try to use the interwiki link to cs: (specified in the source as [[cs:Václav Havel]]), it leads to http://cs.wikipedia.org/wiki/V%E1clav_Havel, which is _wrong_, because the cs: Wikipedia uses UTF-8 and the proper link should be ".../V%C3%A1clav_Havel". And, vice versa, the Czech article contains an interwiki link (specified again as [[en:Václav Havel]]) leads to http://en.wikipedia.org/wiki/V%C3%A1clav_Havel, which is, again, wrong.
I believe that a correct solution (apart from the long-term solution of using UTF-8 everywhere) could be: * Accept UTF-8 in URLs on en: (but how could they be recognized??) * Interwiki linking should use UTF-8 even on en: (or, does another Wikipedia except en: use latin-1?)
Best regards, [[cs:User:Mormegil | Petr Kadlec]]
Petr Kadlec wrote:
I believe that a correct solution (apart from the long-term solution of using UTF-8 everywhere) could be:
- Accept UTF-8 in URLs on en: (but how could they be recognized??)
- Interwiki linking should use UTF-8 even on en: (or, does another
Wikipedia except en: use latin-1?)
Last question: yes there are a few. For those wiki's that are now UTF-8 but were ISO-8859-1 before, the 8859-1 %XX code is also still accepted.
Indeed it is better to avoid the %XX codes in interwiki links. A reasonably good alternative is formed by using &; and &#; entities. Those are independent of the encoding. The pywikipediabot will take this route for all links that can not be expressed natively, and the interwiki bot will automatically convert all %XX links automatically upon passing (but only if other updates are needed to the page).
Regards,
Rob Hooft
On Sun, 28 Nov 2004 19:37:21 +0100, Rob Hooft rob@hooft.net wrote:
Petr Kadlec wrote: Indeed it is better to avoid the %XX codes in interwiki links. A reasonably good alternative is formed by using &; and &#; entities. Those are independent of the encoding. The pywikipediabot will take this route for all links that can not be expressed natively, and the interwiki bot will automatically convert all %XX links automatically upon passing (but only if other updates are needed to the page).
Yes, this is sensible, but it doesn't avoid the problem described here - the actual *URL* will still include one encoding or the other, however cunningly the wiki-code is constructed. Neither http://cs.wikipedia.org/wiki/V%C3%A1clav_Havel nor http://en.wikipedia.org/wiki/V%C3%A1clav_Havel is an existing page, since HTML escaping doesn't belong in a URL. [Amusingly, if you click "article", it takes you to the right page, since the "&" hasn't been further escaped in the HTML]
OTOH, I can't actually get that particular example to break anyway :I can happily click the interwiki links between cs.wp and en.wp and the URL gets re-encoded back and forth just fine. I guess someone fixed it already. ?
On Nov 28, 2004, at 6:32 AM, Petr Kadlec wrote:
It seems to me that handling of non-ASCII characters in interwiki links (or in URLs in general) is a bit problematic. As an example, take [[en:Václav Havel]]. Since en: does not use UTF-8, the URL is ".../V%E1clav_Havel". If you try to use the interwiki link to cs: (specified in the source as [[cs:Václav Havel]]), it leads to http://cs.wikipedia.org/wiki/V%E1clav_Havel, which is _wrong_, because the cs: Wikipedia uses UTF-8 and the proper link should be ".../V%C3%A1clav_Havel".
It detects the encoding on the incoming link and redirects transparently. Where's the problem?
And, vice versa, the Czech article contains an interwiki link (specified again as [[en:Václav Havel]]) leads to http://en.wikipedia.org/wiki/V%C3%A1clav_Havel, which is, again, wrong.
It detects the encoding on the incoming link and redirects transparently. Where's the problem?
I believe that a correct solution (apart from the long-term solution of using UTF-8 everywhere) could be:
- Accept UTF-8 in URLs on en: (but how could they be recognized??)
We already do, see above.
-- brion vibber (brion @ pobox.com)
It detects the encoding on the incoming link and redirects transparently. Where's the problem?
Huh? Wow, I see, it works here in Firefox... Well, I tried it in Opera and it _did not_ work. I don't know if the problem was Opera, some proxy in-between, sunspots, PEBKAC, or whatever. I'll try it again tonight.
-- puzzled [[cs:User:Mormegil | Petr Kadlec]]
On Nov 29, 2004, at 12:35 AM, Petr Kadlec wrote:
It detects the encoding on the incoming link and redirects transparently. Where's the problem?
Huh? Wow, I see, it works here in Firefox... Well, I tried it in Opera and it _did not_ work. I don't know if the problem was Opera, some proxy in-between, sunspots, PEBKAC, or whatever. I'll try it again tonight.
Works for me in Opera 7.54 for Mac OS X (build 1840).
-- brion vibber (brion @ pobox.com)
OK, now I've got it. The problem was in a misconfigured proxy that inserted a bogus Referer header (which is probably used by MediaWiki to detect the need to redirect). So that for a request like that:
GET /wiki/V%E1clav_Havel HTTP/1.1 Host: cs.wikipedia.org Referer: http://cs.wikipedia.org/wiki/V%E1clav_Havel
comes a "HTTP/1.0 200 OK" response with the "page does not exist yet" contents.
After the proxy configuration was fixed, everything works fine.
Thanks for help, -- [[cs:User:Mormegil | Petr Kadlec]]
wikitech-l@lists.wikimedia.org