On 1/17/07, Harish TM harish.tmh@gmail.com wrote:
This is a big problem for me. Cause when I do a regular expression match on "Category :", I get those lines within the article that are references to other categories as well. I Just want the category that the current article belongs to. Its worse cause sometimes its " Category :" sometimes " Category :" and at times "Category :"
Any category link without a leading colon will add the category. With a leading colon, it links. So [[Category:Foo]] categorizes, [[:Category:Foo]] links. You can use that to your advantage.
You may wish to look over the parser at http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Parser.php?v.... Be aware that the parser is not short, simple, or clean, because (as you've discovered) neither is the markup language.
The legal title characters, by default, are " %!"$&'()*,\-.\/0-9:;=?@A-Z\\^_`a-z~\x80-\xFF+". Therefore, a reasonable PCRE regex (with no string escaping of backslashes et al) might be:
/[[Category:([ %!"$&'()*,-./0-9:;=?@A-Z\^_`a-z~\x80-\xFF+]+)]]/i
Which would give you the category name. Hopefully. In most cases. Because other things can make a title invalid. Basically, it's a mess, and the only surefire way around it is using MediaWiki itself to do the parsing.
Cant seem to find it. Searching for it seems to give me wikipeida articles on parsing!!!
Basically: install MediaWiki and hack it to do what you want. The parser is not well-defined or application-independent. You can make your own simplified parser, but it *will* fail in corner-cases (unless your simplifications consist of hacking out stuff you don't need and inlining stuff from other files).
Just to further clarify what it is that I am looking for - Lets say I want to PRINT out a copy of wikipedia ( I know thats insane - but I need text to be as clean as if I were printing it out ), with the articles indexed as per Title and category, how would I get that data??
Use your own script to parse the XML part of the dumps and reformat them how you like. Then run each article's text through the parser, and grab the category list it spits back. It will not be very easy.