Re: [Wikitech-l] How to decode URL to article title?

26 Apr 2007

See also
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/docs/title.txt?view=…

Bryan

On 4/26/07, Reid Priedhorsky &lt;reid(a)umn.edu&gt; wrote:
...
  Neil Harris wrote:
  Mohamed Magdy wrote:
  Reid Priedhorsky wrote:

  Hi,

 I am doing some analysis and need to convert article URLs to article
 titles. For example,

 http://en.wikipedia.org/wiki/Question:_Are_We_Not_Men%3F_Answer:_We_Are_Dev…

 converts to

    Question: Are We Not Men? Answer: We Are Devo!

 I've been doing some searching around but haven't found a specific
 procedure documented anywhere. It looks to me like standard URL
 unescaping followed by replacing underscores with spaces, but I wonder
 if there is more.

 Pointers to documentation or an explanation would be most appreciated.

 Thanks,

 Reid

 http://en.wikipedia.org/wiki/Percent-encoding

 http://netzreport.googlepages.com/online_tool_for_url_en_decoding.html

  And you should also read

 http://tools.ietf.org/html/rfc3987

 http://en.wikipedia.org/wiki/Unicode

 and

 http://en.wikipedia.org/wiki/UTF-8

 and since all Wikipedia URLs are actually IRIs, which are mapped to URLs
 by first UTF-8 encoding the Unicode string, then percent-encoding them.

 To reverse the process, first percent-decode the URL as needed, then
 decode the resulting UTF-8 byte string into Unicode.

 For example,

 Fabry-P%C3%A9rot_interferometer

 decodes to

 Fabry-Pérot interferometer

 ...since %C3%A9 decodes to the two bytes 0xC3 0xA9, which is the UTF-8
 encoding of Unicode code point U+00E9, which encodes the character "é".

 Fortunately, you don't need to deal with IDN or Punycode at all, since
 the article title is encoded entirely in the URL path. 
 OK, to summarize what this thread seems to be saying, the procedure is:

 1. Split the title from the rest of the URL.

 2. Percent-decode the title, yielding a UTF-8 byte string.

 3. Convert the byte string into a Unicode string.

 4. Replace underscores with spaces.

 Step 4 yields the article title, which is what appears in the XML dumps.

 Comments?

 Thanks for the help,

 Reid

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 http://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] How to decode URL to article title?