Dear Wikipedia researchers,
WikiPrep is a preprocessing script written in Perl that takes an XML dump of Wikipedia, and infers some information that was implicitly present there. In particular, it performs the following tasks: 1) Template substitution in article texts 2) Building a hierarchy of categories (i.e., for each category, it collects ids of its immediate descendants) 3) Identifies related articles based on contextual clues 4) Resolves link redirection and dumps additional information that allows one to easily build a link graph for the entire Wikipedia snapshot 5) Computes statistics about categories and links 6) Collects anchor text associated with links pointing at each article
WikiPrep is distributed under the terms of GNU General Public License version 2, and is available at http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep.
Regards,
Evgeniy.