Felipe Ortega wrote:
Hi all.
I'm adding some tweaks to the WikiXRay parser of meta-history dumps. I now extract internal, external links, and so on, but I'd also like to extract the plain text (without HTML code and, possibly, also filtering wiki tags).
Does anyone nows a python library to do that? I believe there should be something out there, as there exist bots and crawlers automating the data extraction process from one wiki to other.
Thanks in advance for your comments.
Felipe.
If you have the html, extracting the plain text is really easy. Just skip everything between < and > and decode entities :P