I have written a complete set of tools that do all of this, but they are not open sourced. I would suggest a simple C or C++ program calling stdin and looking for just the tags you want. Be careful as the buffering required is LARGE to parse these files. You will need at least 16K buffer as many lines read with fgets can exceed 8192 bytes in size.
Look for the beginning tags for each section. Category links are embedded in the articles themselves.
tags are <TAGNAME> start and </TAGNAME> end.
Jeff
Harish TM wrote:
I was trying to parse the Wikipedia dumps but unfortunately I find the XML file that can be downloaded a little hard to parse. I was wondering if there is a neat way to extract: 1. The article title 2. The article content ( without links to articles in other languages, external links and so on ) 3. The category.
Also I find that there are a large number of tools that allow one to convert plain text to media wiki text. What if I want to go the other way and extract information exactly the way it appears on the wikipedia site.
Harish _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l