On Fri, Feb 1, 2008 at 9:14 AM, Huji <huji.huji(a)gmail.com> wrote:
Searching Google for html strippers for python gives
me lots of useful
results, most of them being based on regular expreissions. What else do you
want? (You can of course expand the regexp pattern to include wiki tags)
Huji
On 1/31/08, Felipe Ortega <glimmer_phoenix(a)yahoo.es> wrote:
Hi all.
I'm adding some tweaks to the WikiXRay parser of meta-history dumps. I now
extract internal, external links, and so on, but I'd also like to extract
the plain text (without HTML code and, possibly, also filtering wiki tags).
Does anyone nows a good python library to do that? I believe there should
be something out there, as there exist bots and crawlers automating the data
extraction process from one wiki to other.
Thanks in advance for your comments.
Felipe.
---------------------------------
¿Con Mascota por primera vez? - Sé un mejor Amigo
Entra en Yahoo! Respuestas.
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Use HTMLParser.HTMLParser from the standard library to filter HTML
tags. Only override handle_data( data), handle_charref( name) and
handle_entityref( name) to get pure data without tags.
Regards,
Bryan