URL decoding - Wikitech-l

14 Dec 2007


      I have had a lot of fun already, playing around with Domas' log 
files posted in the last four days. However, the log files contain 
parts of URLs that need to be decoded.  Removing the underscore in 
United_Kingdom is not a problem.  Neither is decoding the correct 
UTF-8 as in Sm%C3%B6rg%C3%A5sbord (Smörgåsbord).  But for the 
Russian Wikipedia, many URLs found in these log files are not 
proper UTF-8.  What method or algorithm should I use to decode 
these URLs, and how can I tell them apart from the majority?  
Does the MediaWiki software make assumptions about ISO 8859-1 for 
Swedish or KOI-8 for Russian URLs?
Currently I use the following simple Perl code for decoding and 
unifying URLs, running in an 8-bit binary environment:
$text =~ s/+/_/g;
    $text =~ s/%([A-Fa-f0-9][A-Fa-f0-9])/sprintf("%c",hex($1))/eg;
    $text =~ s/ /_/g;
-- 
  Lars Aronsson (lars@aronsson.se)
  Aronsson Datateknik - http://aronsson.se