Re: [Wikitech-l] WikiDump Parsing

26 Feb 2007

Brion Vibber wrote:
...
  Harish TM wrote:
  I was trying to parse the Wikipedia dumps but
unfortunately I find the XML
 file that can be downloaded a little hard to parse. I was wondering if there
 is a neat way to extract:
                          1. The article title  
 /mediawiki/page/title

                           2. The article content (
without links to articles
 in other languages, external links and so on )  
 The article content *contains* those links, so I guess you mean you want
 to parse the text and remove certain elements of it?

                           3. The category. 

 Again, that's part of article text.

  Also I find that there are a large number of
tools that allow one to convert
 plain text to media wiki text. What if I want to go the other way and
 extract information exactly the way it appears on the wikipedia site.  
 Run the wiki parser on it. 
Or download (http://static.wikipedia.org/downloads/November_2006/en/) it
parsed.

Matthew Flaschen

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] WikiDump Parsing