Simetrical <Simetrical+wikitech@...> writes:
The legal title characters, by default, are "
%!\"$&'()*,\\-.\\/0-9:;=? <at> A-Z\\\\^_`a-z~\\x80-\\xFF+".
Therefore, a
reasonable PCRE regex (with no string escaping of backslashes et al)
might be:
/\[\[Category:([ %!"$&'()*,\-.\/0-9:;=? <at>
A-Z\\^_`a-z~\x80-\xFF+]+)\]\]/i
Note that one also needs to cope with leading and trailing spaces, and the
"Category" is case-insensitive. So probably something more like:
\[\[ *[Cc][Aa][Tt][Ee][Gg][Oo][Rr][Yy] *: * ...
Which would give you the category name. Hopefully.
In most cases.
Because other things can make a title invalid. Basically, it's a
mess, and the only surefire way around it is using MediaWiki itself to
do the parsing.
Or, to download categorylink.sql file from the same dump, and use the entries
from that. A bit redundant if one needs to parse the markup anyway, though...
A good first step is probably to use an XML toolkit/parser library of some kind
for whatever PL one might be using for the task. Unless one happens to be using
Haskell, I can't offer any specific advice on that... (Did the original querant
specify?) Not a complete solution, since the markup is something of an XML/HTML
mish-mash, but it'll still cope with much of the structure.
Slan,
Alai.