I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title 2. The article content ( without links to articles in other languages, external links and so on ) 3. The category.
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.
Harish