I have had a lot of fun already, playing around with Domas' log
files posted in the last four days. However, the log files contain
parts of URLs that need to be decoded. Removing the underscore in
United_Kingdom is not a problem. Neither is decoding the correct
UTF-8 as in Sm%C3%B6rg%C3%A5sbord (Smörgåsbord). But for the
Russian Wikipedia, many URLs found in these log files are not
proper UTF-8. What method or algorithm should I use to decode
these URLs, and how can I tell them apart from the majority?
Does the MediaWiki software make assumptions about ISO 8859-1 for
Swedish or KOI-8 for Russian URLs?
Currently I use the following simple Perl code for decoding and
unifying URLs, running in an 8-bit binary environment:
$text =~ s/\+/_/g;
$text =~ s/%([A-Fa-f0-9][A-Fa-f0-9])/sprintf("%c",hex($1))/eg;
$text =~ s/ /_/g;
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se