Hi, I need to extract the only the text from a Wikipedia page. I.e., I need to remove all wiki markup, section headings etc, to extract only the text a reader will read.
For example, for the text :
'''Paris''' ([[Help:IPA|pronounced]] /paʁi/ in French; /ˈpaɹɪs/ in English) is the [[communes of France|capital city]] of [[France]]. It is situated on the [[Seine|River Seine]], in northern France, at the heart of the [[Île-de-France (region)|Île-de-France]] [[Regions of France|region]] (aka "Paris Region"; in French: ''Région Parisienne'' or ''RP''). The City of Paris has an estimated population of 2,167,994 within its administrative limits (January 2006)."
I need to get the following after extraction:
Paris (pronounced /paʁi/ in French; /ˈpaɹɪs/ in English) is the capital city France. It is situated on the River Seine, in northern France, at the heart of the Île-de-France region (aka "Paris Region"; in French: ''Région Parisienne'' or ''RP''). The City of Paris has an estimated population of 2,167,994 within its administrative limits (January 2006)."
Using Pywikipediabot framework, I can get the raw text, but not the text-sans-markups. Since I need to do some textual analysis on the article contents, I need to get rid of all the extra markups, citation tags or other templates.
So, what is the best/easiest way to do this? Thanks in advance.
Ragib