Jeffrey Barish wrote:
I am writing a PyGTK application. I would like to be
able to download text
only (with formatting) from Wikipedia and display it in my application. I
think that I am close to a solution, but I have reached an impasse due to my
ignorance of most of the mediawiki API.
My plan has been to use GtkMozembed in my application to render the page, so I
need to retrieve html. What is close to working is to use the index.php API
with action=render and title=<search string for the Wikipedia page>. The
data that I retrieve does display in my browser, but it has the following
undesired characteristics:
2. There are sections at the end that I don't want
(Further reading, External
links, Notes, See also, References).
Those sections are part of the content. The
API doesn't have any
parameter to include/exclude them.
1. All images appear (I want none).
Same issue.
Although it's easier to replace, remove /<img.*?>/
3. Some characters are not rendered correctly (e.g.,
IPA: [ˈvɔlfgaŋ
amaˈdeus ˈmoËtsart]).
You're showing the text as windows-1252, but it is UTF-8.