Dear Wikipedia researchers,
WikiPrep is a preprocessing script written in Perl that takes an XML dump of
Wikipedia, and
infers some information that was implicitly present there. In particular, it
performs the following tasks:
1) Template substitution in article texts
2) Building a hierarchy of categories (i.e., for each category, it collects ids
of its immediate descendants)
3) Identifies related articles based on contextual clues
4) Resolves link redirection and dumps additional information that allows one to
easily build a link graph
for the entire Wikipedia snapshot
5) Computes statistics about categories and links
6) Collects anchor text associated with links pointing at each article
WikiPrep is distributed under the terms of GNU General Public License version 2,
and
is available at
http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep.
Regards,
Evgeniy.