Re: [Wikitech-l] WikiDump Parsing

18 Jan 2007


      Simetrical <Simetrical+wikitech@...> writes:
...
The legal title characters, by default, are "
%!"$&'()*,\-.\/0-9:;=? <at> A-Z\\^_`a-z~\x80-\xFF+".  Therefore, a
reasonable PCRE regex (with no string escaping of backslashes et al)
might be:
/[[Category:([ %!"$&'()*,-./0-9:;=? <at> A-Z\^_`a-z~\x80-\xFF+]+)]]/i
Note that one also needs to cope with leading and trailing spaces, and the
"Category" is case-insensitive.  So probably something more like:
[[ *[Cc][Aa][Tt][Ee][Gg][Oo][Rr][Yy] *: * ...
...
Which would give you the category name.  Hopefully.  In most cases.
Because other things can make a title invalid.  Basically, it's a
mess, and the only surefire way around it is using MediaWiki itself to
do the parsing.
Or, to download categorylink.sql file from the same dump, and use the entries
from that.  A bit redundant if one needs to parse the markup anyway, though...
A good first step is probably to use an XML toolkit/parser library of some kind
for whatever PL one might be using for the task.  Unless one happens to be using
Haskell, I can't offer any specific advice on that...  (Did the original querant
specify?)  Not a complete solution, since the markup is something of an XML/HTML
mish-mash, but it'll still cope with much of the structure.
Slan,
Alai.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] WikiDump Parsing