Re: [Wikitech-l] WikiDump Parsing

18 Jan 2007


      On 1/17/07, Harish TM harish.tmh@gmail.com wrote:
...
This is a big problem for me. Cause when I do a regular expression
match on "Category :", I get those lines within the article that are
references to other categories as well. I Just want the category that
the current article belongs to. Its worse cause sometimes its "
Category :" sometimes "   Category :" and at times "Category :"
Any category link without a leading colon will add the category.  With
a leading colon, it links.  So [[Category:Foo]] categorizes,
[[:Category:Foo]] links.  You can use that to your advantage.
You may wish to look over the parser at
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Parser.php?v....
 Be aware that the parser is not short, simple, or clean, because (as
you've discovered) neither is the markup language.
The legal title characters, by default, are "
%!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+".  Therefore, a
reasonable PCRE regex (with no string escaping of backslashes et al)
might be:
/[[Category:([ %!"$&'()*,-./0-9:;=?@A-Z\^_`a-z~\x80-\xFF+]+)]]/i
Which would give you the category name.  Hopefully.  In most cases.
Because other things can make a title invalid.  Basically, it's a
mess, and the only surefire way around it is using MediaWiki itself to
do the parsing.
...
Cant seem to find it. Searching for it seems to give me wikipeida
articles on parsing!!!
Basically: install MediaWiki and hack it to do what you want.  The
parser is not well-defined or application-independent.  You can make
your own simplified parser, but it *will* fail in corner-cases (unless
your simplifications consist of hacking out stuff you don't need and
inlining stuff from other files).
...
Just to further clarify what it is that I am looking for - Lets say I
want to PRINT out a copy of wikipedia ( I know thats insane - but I
need text to be as clean as if I were printing it out ), with the
articles indexed as per Title and category, how would I get that
data??
Use your own script to parse the XML part of the dumps and reformat
them how you like.  Then run each article's text through the parser,
and grab the category list it spits back.  It will not be very easy.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] WikiDump Parsing