Re: [Wikitech-l] WikiDump Parsing

18 Jan 2007


      I have written a complete set of tools that do all of this, but they are 
not open sourced. I would suggest a simple
C or C++ program calling stdin and looking for just the tags you want. 
Be careful as the buffering required is LARGE
to parse these files. You will need at least 16K buffer as many lines 
read with fgets can exceed 8192 bytes in size.
Look for the beginning tags for each section. Category links are 
embedded in the articles themselves.
tags are <TAGNAME> start and </TAGNAME> end.
Jeff
Harish TM wrote:
...
I was trying to parse the Wikipedia dumps but unfortunately I find the XML
file that can be downloaded a little hard to parse. I was wondering if there
is a neat way to extract:
                        1. The article title
                        2. The article content ( without links to articles
in other languages, external links and so on )
                        3. The category.
Also I find that there are a large number of tools that allow one to convert
plain text to media wiki text. What if I want to go the other way and
extract information exactly the way it appears on the wikipedia site.
Harish
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] WikiDump Parsing