I have written a complete set of tools that do all of this, but they are
not open sourced. I would suggest a simple
C or C++ program calling stdin and looking for just the tags you want.
Be careful as the buffering required is LARGE
to parse these files. You will need at least 16K buffer as many lines
read with fgets can exceed 8192 bytes in size.
Look for the beginning tags for each section. Category links are
embedded in the articles themselves.
tags are <TAGNAME> start and </TAGNAME> end.
Jeff
Harish TM wrote:
I was trying to parse the Wikipedia dumps but
unfortunately I find the XML
file that can be downloaded a little hard to parse. I was wondering if there
is a neat way to extract:
1. The article title
2. The article content ( without links to articles
in other languages, external links and so on )
3. The category.
Also I find that there are a large number of tools that allow one to convert
plain text to media wiki text. What if I want to go the other way and
extract information exactly the way it appears on the wikipedia site.
Harish
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l