Re: [Wikitech-l] Library to filter HTML

1 Feb 2008

On Fri, Feb 1, 2008 at 9:14 AM, Huji &lt;huji.huji(a)gmail.com&gt; wrote:
...
  Searching Google for html strippers for python gives
me lots of useful
  results, most of them being based on regular expreissions. What else do you
  want? (You can of course expand the regexp pattern to include wiki tags)

  Huji

  On 1/31/08, Felipe Ortega &lt;glimmer_phoenix(a)yahoo.es&gt; wrote:

 Hi all.

 I'm adding some tweaks to the WikiXRay parser of meta-history dumps. I now
 extract internal, external links, and so on, but I'd also like to extract
 the plain text (without HTML code and, possibly, also filtering wiki tags).

 Does anyone nows a good python library to do that? I believe there should
 be something out there, as there exist bots and crawlers automating the data
 extraction process from one wiki to other.

 Thanks in advance for your comments.

 Felipe.

 ---------------------------------

 ¿Con Mascota por primera vez? - Sé un mejor Amigo
 Entra en Yahoo! Respuestas.

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 http://lists.wikimedia.org/mailman/listinfo/wikitech-l
   _______________________________________________
  Wikitech-l mailing list
  Wikitech-l(a)lists.wikimedia.org
  http://lists.wikimedia.org/mailman/listinfo/wikitech-l 
Use HTMLParser.HTMLParser from the standard library to filter HTML
tags. Only override handle_data( data), handle_charref( name) and
handle_entityref( name) to get pure data without tags.

Regards,
Bryan

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Library to filter HTML