Hi, I need to extract the only the text from a Wikipedia page. I.e., I need to remove all wiki markup, section headings etc, to extract only the text a reader will read.
For example, for the text :
'''Paris''' ([[Help:IPA|pronounced]] /paʁi/ in French; /ˈpaɹɪs/ in English) is the [[communes of France|capital city]] of [[France]]. It is situated on the [[Seine|River Seine]], in northern France, at the heart of the [[Île-de-France (region)|Île-de-France]] [[Regions of France|region]] (aka "Paris Region"; in French: ''Région Parisienne'' or ''RP''). The City of Paris has an estimated population of 2,167,994 within its administrative limits (January 2006)."
I need to get the following after extraction:
Paris (pronounced /paʁi/ in French; /ˈpaɹɪs/ in English) is the capital city France. It is situated on the River Seine, in northern France, at the heart of the Île-de-France region (aka "Paris Region"; in French: ''Région Parisienne'' or ''RP''). The City of Paris has an estimated population of 2,167,994 within its administrative limits (January 2006)."
Using Pywikipediabot framework, I can get the raw text, but not the text-sans-markups. Since I need to do some textual analysis on the article contents, I need to get rid of all the extra markups, citation tags or other templates.
So, what is the best/easiest way to do this? Thanks in advance.
Ragib
On Sat, Feb 23, 2008 at 8:32 PM, Ragib Hasan ragibhasan@gmail.com wrote:
Hi, I need to extract the only the text from a Wikipedia page. I.e., I need to remove all wiki markup, section headings etc, to extract only the text a reader will read.
Get the rendered HTML, and remove all the HTML markup.
On Sun, Feb 24, 2008 at 7:02 AM, Ragib Hasan ragibhasan@gmail.com wrote:
Hi, I need to extract the only the text from a Wikipedia page. I.e., I need to remove all wiki markup, section headings etc, to extract only the text a reader will read.
For example, for the text :
'''Paris''' ([[Help:IPA|pronounced]] /paʁi/ in French; /ˈpaɹɪs/ in English) is the [[communes of France|capital city]] of [[France]]. It is situated on the [[Seine|River Seine]], in northern France, at the heart of the [[Île-de-France (region)|Île-de-France]] [[Regions of France|region]] (aka "Paris Region"; in French: ''Région Parisienne'' or ''RP''). The City of Paris has an estimated population of 2,167,994 within its administrative limits (January 2006)."
I need to get the following after extraction:
Paris (pronounced /paʁi/ in French; /ˈpaɹɪs/ in English) is the capital city France. It is situated on the River Seine, in northern France, at the heart of the Île-de-France region (aka "Paris Region"; in French: ''Région Parisienne'' or ''RP''). The City of Paris has an estimated population of 2,167,994 within its administrative limits (January 2006)."
Using Pywikipediabot framework, I can get the raw text, but not the text-sans-markups. Since I need to do some textual analysis on the article contents, I need to get rid of all the extra markups, citation tags or other templates.
So, what is the best/easiest way to do this? Thanks in advance.
Ragib
Ragib Hasan PhD Student Dept of Computer Science University of Illinois at Urbana-Champaign 201 N Goodwin Avenue Urbana IL 61801
Website: http://www.ragibhasan.com http://netfiles.uiuc.edu/rhasan/www _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Since you need it for textual analysis hence you have more options. You can use wikiprep (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/), wikiXray (http://meta.wikimedia.org/wiki/WikiXRay). There is to be a better maintained WikiPrep, which is maintained by some Tomaz, you can get that from http://wikiprep.cvs.sourceforge.net/wikiprep/wikiprep/trunk/?hideattic=0. Download wikiprep.pl and images.pm from there.
Hope this helps.
wikitech-l@lists.wikimedia.org