Library to filter HTML

List overview All Threads
Download

newer

older

[Second call for papers] Wikipedia...

Fwd: [Icommons] Call for paper -...

Felipe Ortega

31 Jan 2008 31 Jan '08

1:17 p.m.

Hi all.

I'm adding some tweaks to the WikiXRay parser of meta-history dumps. I now extract internal, external links, and so on, but I'd also like to extract the plain text (without HTML code and, possibly, also filtering wiki tags).

Does anyone nows a good python library to do that? I believe there should be something out there, as there exist bots and crawlers automating the data extraction process from one wiki to other.

Thanks in advance for your comments.

Felipe.

---------------------------------

¿Con Mascota por primera vez? - Sé un mejor Amigo Entra en Yahoo! Respuestas.

Attachments:

attachment.htm (text/html — 891 bytes)

Show replies by date

Kurt Luther

31 Jan 31 Jan

2:57 p.m.

Hi Felipe,

I've found Beautiful Soup to be a useful Python-based HTML parser.

http://www.crummy.com/software/BeautifulSoup/

Kurt

----- Original Message ----- From: "Felipe Ortega" glimmer_phoenix@yahoo.es To: wiki-research-l@lists.wikimedia.org Sent: Thursday, January 31, 2008 8:17:53 AM (GMT-0500) America/New_York Subject: [Wiki-research-l] Library to filter HTML

_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Brian

4:33 p.m.

I've used BeautifulSoup to get plain text out of rendered HTML dumps. Its slow and doesn't work that well. What you really want to do it right is an actual mediawiki parser to strip the syntax out for you.

Try this one: http://code.pediapress.com/wiki/wiki

On Thu, Jan 31, 2008 at 7:57 AM, Kurt Luther luther@cc.gatech.edu wrote:

...

Hi Felipe,

I've found Beautiful Soup to be a useful Python-based HTML parser.

http://www.crummy.com/software/BeautifulSoup/

Kurt

----- Original Message ----- From: "Felipe Ortega" glimmer_phoenix@yahoo.es To: wiki-research-l@lists.wikimedia.org Sent: Thursday, January 31, 2008 8:17:53 AM (GMT-0500) America/New_York Subject: [Wiki-research-l] Library to filter HTML

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Brian

4:33 p.m.

s/right/write/. pre-morning coffee still :)

On Thu, Jan 31, 2008 at 9:33 AM, Brian Brian.Mingus@colorado.edu wrote:

...

I've used BeautifulSoup to get plain text out of rendered HTML dumps. Its slow and doesn't work that well. What you really want to do it right is an actual mediawiki parser to strip the syntax out for you.

Try this one: http://code.pediapress.com/wiki/wiki

On Thu, Jan 31, 2008 at 7:57 AM, Kurt Luther luther@cc.gatech.edu wrote:

...
Hi Felipe,

I've found Beautiful Soup to be a useful Python-based HTML parser.

http://www.crummy.com/software/BeautifulSoup/

Kurt

----- Original Message ----- From: "Felipe Ortega" glimmer_phoenix@yahoo.es To: wiki-research-l@lists.wikimedia.org Sent: Thursday, January 31, 2008 8:17:53 AM (GMT-0500) America/New_York Subject: [Wiki-research-l] Library to filter HTML

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Felipe Ortega

11:07 p.m.

Thanks a lot. Performance is an important issue in this case (think about parsing the entire enwiki).

I'll give it a chance and post my comments.

Thanks for the feedback.

Felipe.

Brian Brian.Mingus@colorado.edu escribió: s/right/write/. pre-morning coffee still :)

On Thu, Jan 31, 2008 at 9:33 AM, Brian Brian.Mingus@colorado.edu wrote: I've used BeautifulSoup to get plain text out of rendered HTML dumps. Its slow and doesn't work that well. What you really want to do it right is an actual mediawiki parser to strip the syntax out for you.

Try this one: http://code.pediapress.com/wiki/wiki

On Thu, Jan 31, 2008 at 7:57 AM, Kurt Luther luther@cc.gatech.edu wrote: Hi Felipe,

I've found Beautiful Soup to be a useful Python-based HTML parser.

http://www.crummy.com/software/BeautifulSoup/

Kurt

_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wiki-research-l

---------------------------------

¿Con Mascota por primera vez? - Sé un mejor Amigo Entra en Yahoo! Respuestas.

6169

Age (days ago)

6169

Last active (days ago)

wiki-research-l@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Brian
Felipe Ortega
Kurt Luther