Dear Wikipedia researchers,
WikiPrep is a preprocessing script written in Perl that takes an XML dump of Wikipedia, and infers some information that was implicitly present there. In particular, it performs the following tasks: 1) Template substitution in article texts 2) Building a hierarchy of categories (i.e., for each category, it collects ids of its immediate descendants) 3) Identifies related articles based on contextual clues 4) Resolves link redirection and dumps additional information that allows one to easily build a link graph for the entire Wikipedia snapshot 5) Computes statistics about categories and links 6) Collects anchor text associated with links pointing at each article
WikiPrep is distributed under the terms of GNU General Public License version 2, and is available at http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep.
Regards,
Evgeniy.
On 8/6/07, Evgeniy Gabrilovich gabr@cs.technion.ac.il wrote:
Dear Wikipedia researchers,
WikiPrep is a preprocessing script written in Perl that takes an XML dump of Wikipedia, and
This looks quite useful. Thanks!
Interesting work!
As a member of the steering committee for the WikiSym conference(www.wikisym.org), I would encourage you (and all wiki researchers on this list) to submit a paper at future editions of the conference.
It's unfortunately too late for the 2007 edition.
---- Alain Désilets, National Research Council of Canada Chair, WikiSym 2007
2007 International Symposium on Wikis Wikis at Work in the World: Open, Organic, Participatory Media for the 21st Century
http://www.wikisym.org/ws2007/
-----Original Message----- From: wiki-research-l-bounces@lists.wikimedia.org [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Evgeniy Gabrilovich Sent: August 6, 2007 6:00 PM To: wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] Wikipedia preprocessing tool (wikiprep)
Dear Wikipedia researchers,
WikiPrep is a preprocessing script written in Perl that takes an XML dump of Wikipedia, and infers some information that was implicitly present there. In particular, it performs the following tasks:
- Template substitution in article texts
- Building a hierarchy of categories (i.e., for each
category, it collects ids of its immediate descendants) 3) Identifies related articles based on contextual clues 4) Resolves link redirection and dumps additional information that allows one to easily build a link graph for the entire Wikipedia snapshot 5) Computes statistics about categories and links 6) Collects anchor text associated with links pointing at each article
WikiPrep is distributed under the terms of GNU General Public License version 2, and is available at http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep.
Regards,
Evgeniy.
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org