Neil Harris wrote:
Mohamed Magdy wrote:
Reid Priedhorsky wrote:
Hi,
I am doing some analysis and need to convert article URLs to article
titles. For example,
http://en.wikipedia.org/wiki/Question:_Are_We_Not_Men%3F_Answer:_We_Are_Dev…
converts to
Question: Are We Not Men? Answer: We Are Devo!
I've been doing some searching around but haven't found a specific
procedure documented anywhere. It looks to me like standard URL
unescaping followed by replacing underscores with spaces, but I wonder
if there is more.
Pointers to documentation or an explanation would be most appreciated.
Thanks,
Reid
http://en.wikipedia.org/wiki/Percent-encoding
http://netzreport.googlepages.com/online_tool_for_url_en_decoding.html
And you should also read
http://tools.ietf.org/html/rfc3987
http://en.wikipedia.org/wiki/Unicode
and
http://en.wikipedia.org/wiki/UTF-8
and since all Wikipedia URLs are actually IRIs, which are mapped to URLs
by first UTF-8 encoding the Unicode string, then percent-encoding them.
To reverse the process, first percent-decode the URL as needed, then
decode the resulting UTF-8 byte string into Unicode.
For example,
Fabry-P%C3%A9rot_interferometer
decodes to
Fabry-Pérot interferometer
...since %C3%A9 decodes to the two bytes 0xC3 0xA9, which is the UTF-8
encoding of Unicode code point U+00E9, which encodes the character "é".
Fortunately, you don't need to deal with IDN or Punycode at all, since
the article title is encoded entirely in the URL path.
OK, to summarize what this thread seems to be saying, the procedure is:
1. Split the title from the rest of the URL.
2. Percent-decode the title, yielding a UTF-8 byte string.
3. Convert the byte string into a Unicode string.
4. Replace underscores with spaces.
Step 4 yields the article title, which is what appears in the XML dumps.
Comments?
Thanks for the help,
Reid
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l