On Saturday 21 March 2009 16:15:54 Platonides wrote:
Jeffrey Barish wrote:
I am writing a PyGTK application. I would like
to be able to download
text only (with formatting) from Wikipedia and display it in my
application. I think that I am close to a solution, but I have reached
an impasse due to my ignorance of most of the mediawiki API.
My plan has been to use GtkMozembed in my application to render the page,
so I need to retrieve html. What is close to working is to use the
index.php API with action=render and title=<search string for the
Wikipedia page>. The data that I retrieve does display in my browser,
but it has the following undesired characteristics:
2. There are sections at the end that I don't want (Further reading,
External links, Notes, See also, References).
Those sections are part of the content. The API doesn't have any
parameter to include/exclude them.
1. All images appear (I want none).
Same issue. Although it's easier to replace, remove /<img.*?>/
It seems that images appear in <div class="thumbcaption"></div>
blocks. Would
you advise using regular expressions to remove these blocks, or should I use
something like BeautifulSoup to parse the page formally and then remove
elements?
3. Some
characters are not rendered correctly (e.g., IPA: [ˈvɔlfgaŋ
amaˈdeus ˈmoËtsart]).
You're showing the text as windows-1252, but it is UTF-8.
It seems that the html lacks the meta field that specifies the character
encoding. The original page does not, of course. Is there a parameter that
causes action=render to include the metadata? Am I using the wrong action?
Can I safely assume that all Wikipedia pages use UTF-8?
--
Jeffrey Barish