On 1/17/07, Harish TM <harish.tmh(a)gmail.com> wrote:
This is a big problem for me. Cause when I do a
match on "Category :", I get those lines within the article that are
references to other categories as well. I Just want the category that
the current article belongs to. Its worse cause sometimes its "
Category :" sometimes " Category :" and at times "Category :"
Any category link without a leading colon will add the category. With
a leading colon, it links. So [[Category:Foo]] categorizes,
[[:Category:Foo]] links. You can use that to your advantage.
You may wish to look over the parser at
Be aware that the parser is not short, simple, or clean, because (as
you've discovered) neither is the markup language.
The legal title characters, by default, are "
%!\"$&'()*,\\-.\\/0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF+". Therefore, a
reasonable PCRE regex (with no string escaping of backslashes et al)
Which would give you the category name. Hopefully. In most cases.
Because other things can make a title invalid. Basically, it's a
mess, and the only surefire way around it is using MediaWiki itself to
do the parsing.
Cant seem to find it. Searching for it seems to give
articles on parsing!!!
Basically: install MediaWiki and hack it to do what you want. The
parser is not well-defined or application-independent. You can make
your own simplified parser, but it *will* fail in corner-cases (unless
your simplifications consist of hacking out stuff you don't need and
inlining stuff from other files).
Just to further clarify what it is that I am looking
for - Lets say I
want to PRINT out a copy of wikipedia ( I know thats insane - but I
need text to be as clean as if I were printing it out ), with the
articles indexed as per Title and category, how would I get that
Use your own script to parse the XML part of the dumps and reformat
them how you like. Then run each article's text through the parser,
and grab the category list it spits back. It will not be very easy.