[Wikitech-l] URL decoding

13 Dec 2007


      I have had a lot of fun already, playing around with Domas' log 
files posted in the last four days. However, the log files contain 
parts of URLs that need to be decoded.  Removing the underscore in 
United_Kingdom is not a problem.  Neither is decoding the correct 
UTF-8 as in Sm%C3%B6rg%C3%A5sbord (Smörgåsbord).  But for the 
Russian Wikipedia, many URLs found in these log files are not 
proper UTF-8.  What method or algorithm should I use to decode 
these URLs, and how can I tell them apart from the majority?  
Does the MediaWiki software make assumptions about ISO 8859-1 for 
Swedish or KOI-8 for Russian URLs?
Currently I use the following simple Perl code for decoding and 
unifying URLs, running in an 8-bit binary environment:
$text =~ s/+/_/g;
    $text =~ s/%([A-Fa-f0-9][A-Fa-f0-9])/sprintf("%c",hex($1))/eg;
    $text =~ s/ /_/g;
-- 
  Lars Aronsson (lars@aronsson.se)
  Aronsson Datateknik - http://aronsson.se

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] URL decoding