I am writing a PyGTK application. I would like to be able to download text only (with formatting) from Wikipedia and display it in my application. I think that I am close to a solution, but I have reached an impasse due to my ignorance of most of the mediawiki API.
My plan has been to use GtkMozembed in my application to render the page, so I need to retrieve html. What is close to working is to use the index.php API with action=render and title=<search string for the Wikipedia page>. The data that I retrieve does display in my browser, but it has the following undesired characteristics:
1. All images appear (I want none). 2. There are sections at the end that I don't want (Further reading, External links, Notes, See also, References). 3. Some characters are not rendered correctly (e.g., IPA: [ˈvÉ”lfgaÅ‹ amaˈdeus ˈmoËtsart]).
To fix 1 and 2, I could perhaps use an html parser and delete the offending items, but I wonder whether there is a proper solution using the mediawiki API (such as a prop parameter with which I could at least specify that I don't want any images).
I assume that 3 is a unicode problem, but I don't know what to do to fix it.
Jeffrey Barish wrote:
I am writing a PyGTK application. I would like to be able to download text only (with formatting) from Wikipedia and display it in my application. I think that I am close to a solution, but I have reached an impasse due to my ignorance of most of the mediawiki API.
My plan has been to use GtkMozembed in my application to render the page, so I need to retrieve html. What is close to working is to use the index.php API with action=render and title=<search string for the Wikipedia page>. The data that I retrieve does display in my browser, but it has the following undesired characteristics:
- There are sections at the end that I don't want (Further reading, External
links, Notes, See also, References).
Those sections are part of the content. The API doesn't have any parameter to include/exclude them.
- All images appear (I want none).
Same issue. Although it's easier to replace, remove /<img.*?>/
- Some characters are not rendered correctly (e.g., IPA: [ˈvɔlfgaŋ
amaˈdeus ˈmoËtsart]).
You're showing the text as windows-1252, but it is UTF-8.
On Saturday 21 March 2009 16:15:54 Platonides wrote:
Jeffrey Barish wrote:
I am writing a PyGTK application. I would like to be able to download text only (with formatting) from Wikipedia and display it in my application. I think that I am close to a solution, but I have reached an impasse due to my ignorance of most of the mediawiki API.
My plan has been to use GtkMozembed in my application to render the page, so I need to retrieve html. What is close to working is to use the index.php API with action=render and title=<search string for the Wikipedia page>. The data that I retrieve does display in my browser, but it has the following undesired characteristics:
- There are sections at the end that I don't want (Further reading,
External links, Notes, See also, References).
Those sections are part of the content. The API doesn't have any parameter to include/exclude them.
- All images appear (I want none).
Same issue. Although it's easier to replace, remove /<img.*?>/
It seems that images appear in <div class="thumbcaption"></div> blocks. Would you advise using regular expressions to remove these blocks, or should I use something like BeautifulSoup to parse the page formally and then remove elements?
- Some characters are not rendered correctly (e.g., IPA: [ˈvɔlfgaŋ
amaˈdeus ˈmoËtsart]).
You're showing the text as windows-1252, but it is UTF-8.
It seems that the html lacks the meta field that specifies the character encoding. The original page does not, of course. Is there a parameter that causes action=render to include the metadata? Am I using the wrong action? Can I safely assume that all Wikipedia pages use UTF-8?
Jeffrey Barish schreef:
On Saturday 21 March 2009 16:15:54 Platonides wrote:
Jeffrey Barish wrote:
I am writing a PyGTK application. I would like to be able to download text only (with formatting) from Wikipedia and display it in my application. I think that I am close to a solution, but I have reached an impasse due to my ignorance of most of the mediawiki API.
My plan has been to use GtkMozembed in my application to render the page, so I need to retrieve html. What is close to working is to use the index.php API with action=render and title=<search string for the Wikipedia page>. The data that I retrieve does display in my browser, but it has the following undesired characteristics:
- There are sections at the end that I don't want (Further reading,
External links, Notes, See also, References).
Those sections are part of the content. The API doesn't have any parameter to include/exclude them.
- All images appear (I want none).
Same issue. Although it's easier to replace, remove /<img.*?>/
It seems that images appear in <div class="thumbcaption"></div> blocks. Would you advise using regular expressions to remove these blocks, or should I use something like BeautifulSoup to parse the page formally and then remove elements?
Only thumbnailed images appear in such blocks. You should really just remove <img> tags if you want to get rid of images.
- Some characters are not rendered correctly (e.g., IPA: [ˈvɔlfgaŋ
amaˈdeus ˈmoËtsart]).
You're showing the text as windows-1252, but it is UTF-8.
It seems that the html lacks the meta field that specifies the character encoding. The original page does not, of course. Is there a parameter that causes action=render to include the metadata? Am I using the wrong action? Can I safely assume that all Wikipedia pages use UTF-8?
Yes, MediaWiki always outputs UTF-8.
Roan Kattouw (Catrope)
On 3/22/09 7:49 AM, Jeffrey Barish wrote:
- Some characters are not rendered correctly (e.g., IPA: [ˈvɔlfgaŋ
amaˈdeus ˈmoËtsart]).
You're showing the text as windows-1252, but it is UTF-8.
It seems that the html lacks the meta field that specifies the character encoding. The original page does not, of course. Is there a parameter that causes action=render to include the metadata? Am I using the wrong action?
You're getting an HTML fragment here, not a full HTML document. As the data consumer, it's your responsibility to ensure you're sending correct Content-Type headers or wrapping things in <html><head>blah blah</head></html> as necessary.
Can I safely assume that all Wikipedia pages use UTF-8?
Yes.
-- brion
mediawiki-api@lists.wikimedia.org