I have had a lot of fun already, playing around with Domas' log files posted in the last four days. However, the log files contain parts of URLs that need to be decoded. Removing the underscore in United_Kingdom is not a problem. Neither is decoding the correct UTF-8 as in Sm%C3%B6rg%C3%A5sbord (Smörgåsbord). But for the Russian Wikipedia, many URLs found in these log files are not proper UTF-8. What method or algorithm should I use to decode these URLs, and how can I tell them apart from the majority? Does the MediaWiki software make assumptions about ISO 8859-1 for Swedish or KOI-8 for Russian URLs?
Currently I use the following simple Perl code for decoding and unifying URLs, running in an 8-bit binary environment:
$text =~ s/+/_/g; $text =~ s/%([A-Fa-f0-9][A-Fa-f0-9])/sprintf("%c",hex($1))/eg; $text =~ s/ /_/g;