Hi
OK. I have recently been tearing my hair out trying to understand why an character set encoding bug suddenly popped up in my application. After a bit of debugging I think I may have an explanation. Let me introduce the problem...
Consider the simple fragment:
page = wikipedia.Page(wikipedia.getSite(), 'Main_Page') content = page.get()
The error I was getting was: AssertionError: charset for mylocalwiki:en changed from UTF-8 to iso-8859-1
So what is happening here?
1) A call to get() is made on the page object
2) A call to _getEditPage() is made on the page object
3) A call to getURL() is made on the Site object with path argument = "/wiki/index.php?title=Main_Page&action=edit".
4) The full URL is created adding a protocol and hostname etc.
5) A connection to the URL is opened. The resulting HTTP headers and html text is buffered
6) The content type is read from the resulting header. The charset variable is found from the content type header. (UTF-8 in my case)
7) The character set is checked to make sure it matches the family file setting. It does. The Site object charset instance variable is set to the same value.
8) The resulting text is converted to unicode using the found charset.
9) A call to _getUserData() is made on the Site object
10) A call to isBlocked() is made on the Site object
11) A call to api_address() is made on the Site object
12) A call to api_address() is made on the Family object
13) A call to apipath() is made on the Family object
14) apipath() returns the relative hardcoded URL="/w/api.php" back to the caller!!! However this URL is invalid! The wiki prefix should be "/wiki" in my case, not "/w".
15) The relative URL is now: "/w/api.php?action=query&meta=userinfo&uiprop=blockinfo"
16) The method getURL() proceeds to call this URL prefixing the URL with the protocol and server host name
17) The URL is not found. A 404 is returned from the underlying Apache server. The returned content type is once again parsed, but now the Apache server is returning a charset of ISO-8859-1, not UTF-8.
18) checkCharset() is once again called on the Site object leading to the AssertionError.
So how to solve it? One might argue that the Apache server should also be returning UTF-8, but that will most likely only solve half the problem.
Instead, I created a new method in my family file, overriding the default version in wikipedia.py:
def apipath(self, code): return '/wiki/api.php'
Maybe someone should look at the Family class. It seems both querypath() and path() also use the hardcoded "/w" prefix.
I am surprised no one else has experienced this error before - unless everyone is overriding the methods from Family.py.
Comments?
Regards Lee Francis
pywikipedia-l@lists.wikimedia.org