Simetrical <Simetrical+wikitech@...> writes:
The legal title characters, by default, are " %!"$&'()*,\-.\/0-9:;=? <at> A-Z\\^_`a-z~\x80-\xFF+". Therefore, a reasonable PCRE regex (with no string escaping of backslashes et al) might be:
/[[Category:([ %!"$&'()*,-./0-9:;=? <at> A-Z\^_`a-z~\x80-\xFF+]+)]]/i
Note that one also needs to cope with leading and trailing spaces, and the "Category" is case-insensitive. So probably something more like:
[[ *[Cc][Aa][Tt][Ee][Gg][Oo][Rr][Yy] *: * ...
Which would give you the category name. Hopefully. In most cases. Because other things can make a title invalid. Basically, it's a mess, and the only surefire way around it is using MediaWiki itself to do the parsing.
Or, to download categorylink.sql file from the same dump, and use the entries from that. A bit redundant if one needs to parse the markup anyway, though...
A good first step is probably to use an XML toolkit/parser library of some kind for whatever PL one might be using for the task. Unless one happens to be using Haskell, I can't offer any specific advice on that... (Did the original querant specify?) Not a complete solution, since the markup is something of an XML/HTML mish-mash, but it'll still cope with much of the structure.
Slan, Alai.