[Wikitech-l] WikiDump Parsing

17 Jan 2007


      I was trying to parse the Wikipedia dumps but unfortunately I find the XML
file that can be downloaded a little hard to parse. I was wondering if there
is a neat way to extract:
                         1. The article title
                         2. The article content ( without links to articles
in other languages, external links and so on )
                         3. The category.
Also I find that there are a large number of tools that allow one to convert
plain text to media wiki text. What if I want to go the other way and
extract information exactly the way it appears on the wikipedia site.
Harish

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] WikiDump Parsing