Harish TM wrote:
I was trying to parse the Wikipedia dumps but
unfortunately I find the XML
file that can be downloaded a little hard to parse. I was wondering if there
is a neat way to extract:
1. The article title
2. The article content (
without links to articles
in other languages, external links and so on )
The article content *contains* those links, so I guess you mean you want
to parse the text and remove certain elements of it?
3. The category.
Again, that's part of article text.
Also I find that there are a large number of
tools that allow one to convert
plain text to media wiki text. What if I want to go the other way and
extract information exactly the way it appears on the wikipedia site.
Run the wiki parser on it.