Re: [Wikitech-l] Library for extracting plain text

31 Jan 2008


      Felipe Ortega wrote:
...
Hi all.
I'm adding some tweaks to the WikiXRay parser of meta-history dumps. I now extract internal, external links, and so on, but I'd also like to extract the plain text (without HTML code and, possibly, also filtering wiki tags).
Does anyone nows a python library to do that? I believe there should be something out there, as there exist bots and crawlers automating the data extraction process from one wiki to other.
Thanks in advance for your comments.
Felipe.
If you have the html, extracting the plain text is really easy. Just 
skip everything between < and > and decode entities :P

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Library for extracting plain text