Hi
OK. I have recently been tearing my hair out trying to understand why
an character set encoding bug suddenly popped up in my application.
After a bit of debugging I think I may have an explanation. Let me
introduce the problem...
Consider the simple fragment:
page = wikipedia.Page(wikipedia.getSite(), 'Main_Page')
content = page.get()
The error I was getting was:
AssertionError: charset for mylocalwiki:en changed from UTF-8 to iso-8859-1
So what is happening here?
1) A call to get() is made on the page object
2) A call to _getEditPage() is made on the page object
3) A call to getURL() is made on the Site object with path argument =
"/wiki/index.php?title=Main_Page&action=edit".
4) The full URL is created adding a protocol and hostname etc.
5) A connection to the URL is opened. The resulting HTTP headers and
html text is buffered
6) The content type is read from the resulting header. The charset
variable is found from the content type header. (UTF-8 in my case)
7) The character set is checked to make sure it matches the family
file setting. It does. The Site object charset instance variable is
set to the same value.
8) The resulting text is converted to unicode using the found charset.
9) A call to _getUserData() is made on the Site object
10) A call to isBlocked() is made on the Site object
11) A call to api_address() is made on the Site object
12) A call to api_address() is made on the Family object
13) A call to apipath() is made on the Family object
14) apipath() returns the relative hardcoded URL="/w/api.php" back to
the caller!!! However this URL is invalid! The wiki prefix should be
"/wiki" in my case, not "/w".
15) The relative URL is now:
"/w/api.php?action=query&meta=userinfo&uiprop=blockinfo"
16) The method getURL() proceeds to call this URL prefixing the URL
with the protocol and server host name
17) The URL is not found. A 404 is returned from the underlying Apache
server. The returned content type is once again parsed, but now the
Apache server is returning a charset of ISO-8859-1, not UTF-8.
18) checkCharset() is once again called on the Site object leading to
the AssertionError.
So how to solve it? One might argue that the Apache server should also
be returning UTF-8, but that will most likely only solve half the
problem.
Instead, I created a new method in my family file, overriding the
default version in wikipedia.py:
def apipath(self, code):
return '/wiki/api.php'
Maybe someone should look at the Family class. It seems both
querypath() and path() also use the hardcoded "/w" prefix.
I am surprised no one else has experienced this error before - unless
everyone is overriding the methods from Family.py.
Comments?
Regards
Lee Francis
--
_____
In theory, there is no difference between theory and practice. But, in
practice, there is.
-- Jan L.A. van de Snepscheut
Show replies by date