Thanks a lot. Performance is an important issue in this case (think about parsing the entire enwiki).
I'll give it a chance and post my comments.
Thanks for the feedback.
Felipe.
Brian <Brian.Mingus@colorado.edu> escribió:
s/right/write/. pre-morning coffee still :)On Thu, Jan 31, 2008 at 9:33 AM, Brian <Brian.Mingus@colorado.edu> wrote:I've used BeautifulSoup to get plain text out of rendered HTML dumps. Its slow and doesn't work that well. What you really want to do it right is an actual mediawiki parser to strip the syntax out for you.
Try this one: http://code.pediapress.com/wiki/wiki
On Thu, Jan 31, 2008 at 7:57 AM, Kurt Luther <luther@cc.gatech.edu> wrote:Hi Felipe,
I've found Beautiful Soup to be a useful Python-based HTML parser.
http://www.crummy.com/software/BeautifulSoup/
Kurt
----- Original Message -----
From: "Felipe Ortega" <glimmer_phoenix@yahoo.es>
To: wiki-research-l@lists.wikimedia.org
Sent: Thursday, January 31, 2008 8:17:53 AM (GMT-0500) America/New_York
Subject: [Wiki-research-l] Library to filter HTML
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wiki-research-l