Library to filter HTML

List overview All Threads
Download

newer

older

Blocking user accounts and...

Problem with large articles?

Felipe Ortega

31 Jan 2008 31 Jan '08

7:03 a.m.

Hi all.

I'm adding some tweaks to the WikiXRay parser of meta-history dumps. I now extract internal, external links, and so on, but I'd also like to extract the plain text (without HTML code and, possibly, also filtering wiki tags).

Does anyone nows a good python library to do that? I believe there should be something out there, as there exist bots and crawlers automating the data extraction process from one wiki to other.

Thanks in advance for your comments.

Felipe.

---------------------------------

¿Con Mascota por primera vez? - Sé un mejor Amigo Entra en Yahoo! Respuestas.

Show replies by date

Huji

1 Feb 1 Feb

2:14 a.m.

Searching Google for html strippers for python gives me lots of useful results, most of them being based on regular expreissions. What else do you want? (You can of course expand the regexp pattern to include wiki tags)

Huji

On 1/31/08, Felipe Ortega glimmer_phoenix@yahoo.es wrote:

...

Hi all.

I'm adding some tweaks to the WikiXRay parser of meta-history dumps. I now extract internal, external links, and so on, but I'd also like to extract the plain text (without HTML code and, possibly, also filtering wiki tags).

Does anyone nows a good python library to do that? I believe there should be something out there, as there exist bots and crawlers automating the data extraction process from one wiki to other.

Thanks in advance for your comments.

Felipe.

¿Con Mascota por primera vez? - Sé un mejor Amigo Entra en Yahoo! Respuestas.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Bryan Tong Minh

4:48 a.m.

On Fri, Feb 1, 2008 at 9:14 AM, Huji huji.huji@gmail.com wrote:

...

Searching Google for html strippers for python gives me lots of useful results, most of them being based on regular expreissions. What else do you want? (You can of course expand the regexp pattern to include wiki tags)

Huji

On 1/31/08, Felipe Ortega glimmer_phoenix@yahoo.es wrote:

...
Hi all.

I'm adding some tweaks to the WikiXRay parser of meta-history dumps. I now extract internal, external links, and so on, but I'd also like to extract the plain text (without HTML code and, possibly, also filtering wiki tags).

Does anyone nows a good python library to do that? I believe there should be something out there, as there exist bots and crawlers automating the data extraction process from one wiki to other.

Thanks in advance for your comments.

Felipe.

¿Con Mascota por primera vez? - Sé un mejor Amigo Entra en Yahoo! Respuestas.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Use HTMLParser.HTMLParser from the standard library to filter HTML tags. Only override handle_data( data), handle_charref( name) and handle_entityref( name) to get pure data without tags.

Regards, Bryan

Felipe Ortega

3 Feb 3 Feb

10:27 a.m.

Thank you for your answers. Performance is an issue in my case, as I have to parse each an every revision for each dump, so I wasn't asking for any solution (I've already searched Google before asking ;) ), but for the best one, in your opinion.

Regards,

Felipe.

Bryan Tong Minh bryan.tongminh@gmail.com escribió: On Fri, Feb 1, 2008 at 9:14 AM, Huji wrote:

...

Searching Google for html strippers for python gives me lots of useful results, most of them being based on regular expreissions. What else do you want? (You can of course expand the regexp pattern to include wiki tags)

Huji

On 1/31/08, Felipe Ortega wrote:

...
Hi all.

I'm adding some tweaks to the WikiXRay parser of meta-history dumps. I now extract internal, external links, and so on, but I'd also like to extract the plain text (without HTML code and, possibly, also filtering wiki tags).

Does anyone nows a good python library to do that? I believe there should be something out there, as there exist bots and crawlers automating the data extraction process from one wiki to other.

Thanks in advance for your comments.

Felipe.

¿Con Mascota por primera vez? - Sé un mejor Amigo Entra en Yahoo! Respuestas.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Use HTMLParser.HTMLParser from the standard library to filter HTML tags. Only override handle_data( data), handle_charref( name) and handle_entityref( name) to get pure data without tags.

Regards, Bryan

_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

---------------------------------

¿Con Mascota por primera vez? - Sé un mejor Amigo Entra en Yahoo! Respuestas.

6178

Age (days ago)

6181

Last active (days ago)

wikitech-l@lists.wikimedia.org

3 comments

3 participants

tags (0)

participants (3)

Bryan Tong Minh
Felipe Ortega
Huji