Reid Priedhorsky wrote:
OK, to summarize what this thread seems to be saying, the procedure is:
1. Split the title from the rest of the URL.
2. Percent-decode the title, yielding a UTF-8 byte string.
3. Convert the byte string into a Unicode string.
4. Replace underscores with spaces.
Step 4 yields the article title, which is what appears in the XML dumps.
Hi folks,
Thanks again for your help. Here's the Python code I came up with, in
the hopes that it might be useful for others.
import urllib
urllib.unquote(title).decode('UTF-8').replace('_', ' ')
Reid