Extracting only the text from a Wikipedia page - Wikitech-l

24 Feb 2008


      Hi,
I need to extract the only the text from a Wikipedia page. I.e., I
need to remove all wiki markup, section headings etc, to extract only
the text a reader will read.
For example, for the text :
'''Paris''' ([[Help:IPA|pronounced]] /paʁi/ in French; /ˈpaɹɪs/ in
English) is the [[communes of France|capital city]] of [[France]]. It
is situated on the [[Seine|River Seine]], in northern France, at the
heart of the [[Île-de-France (region)|Île-de-France]] [[Regions of
France|region]] (aka "Paris Region"; in French: ''Région Parisienne''
or ''RP''). The City of Paris has an estimated population of 2,167,994
within its administrative limits (January 2006)."
I need to get the following after extraction:
Paris  (pronounced /paʁi/ in French; /ˈpaɹɪs/ in English) is the
capital city France. It is situated on the River Seine, in northern
France, at the heart of the Île-de-France region (aka "Paris Region";
in French: ''Région Parisienne'' or ''RP''). The City of Paris has an
estimated population of 2,167,994 within its administrative limits
(January 2006)."
Using Pywikipediabot framework, I can get the raw text, but not the
text-sans-markups. Since I need to do some textual analysis on the
article contents, I need to get rid of all the extra markups, citation
tags or other templates.
So, what is the best/easiest way to do this? Thanks in advance.
Ragib
-- 
Ragib Hasan
PhD Student
Dept of Computer Science
University of Illinois at Urbana-Champaign
201 N Goodwin Avenue
Urbana IL 61801

Website:
http://www.ragibhasan.com
http://netfiles.uiuc.edu/rhasan/www