Re: [Mediawiki-api] Problems rendering html

23 Mar 2009

Jeffrey Barish schreef:
...
  On Saturday 21 March 2009 16:15:54 Platonides wrote:
  Jeffrey Barish wrote:
  I am writing a PyGTK application.  I would like
to be able to download
 text only (with formatting) from Wikipedia and display it in my
 application.  I think that I am close to a solution, but I have reached
 an impasse due to my ignorance of most of the mediawiki API.

 My plan has been to use GtkMozembed in my application to render the page,
 so I need to retrieve html.  What is close to working is to use the
 index.php API with action=render and title=<search string for the
 Wikipedia page>.  The data that I retrieve does display in my browser,
 but it has the following undesired characteristics:

 2. There are sections at the end that I don't want (Further reading,
 External links, Notes, See also, References).  Those sections are part of the
content. The API doesn't have any
 parameter to include/exclude them.

  1. All images appear (I want none).  Same
issue. Although it's easier to replace, remove /<img.*?>/  
 It seems that images appear in <div class="thumbcaption"></div>
blocks.  Would 
 you advise using regular expressions to remove these blocks, or should I use 
 something like BeautifulSoup to parse the page formally and then remove 
 elements?
  Only thumbnailed images appear in such blocks. You should really just 
remove <img> tags if you want to get rid of images.

...
    3. Some characters are not rendered correctly (e.g.,
IPA: [ËˆvÉ”lfgaÅ‹
 amaËˆdeus ËˆmoËtsart]).  You're showing the text as windows-1252, but it is
UTF-8.  
 It seems that the html lacks the meta field that specifies the character 
 encoding.  The original page does not, of course.  Is there a parameter that 
 causes action=render to include the metadata?  Am I using the wrong action?  
 Can I safely assume that all Wikipedia pages use UTF-8? Yes, MediaWiki always
outputs UTF-8.

Roan Kattouw (Catrope)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Mediawiki-api] Problems rendering html