Wikilink Classification and Parsing - Wikitech-l

30 Sep 2015

Hello,

I need some help. I have to classify the wikilinks in a Wikipedia article based on their
relative position in the article (in best case on the rendered page). For each wikilink I
would like to have something like the position in text (ascending for each section), if it
is in a infobox and if it is in a navibox. I need this classification for a specific
revision of every article in the English Wikipedia in the zero namespace . I tried out to
do it by parsing the wikitext, but there are some problems with replacing the templates.
For example if a template is embedded with parameters and/or with conditions it is a bit
difficult to know what exactly is rendered. I tried out some parser from
https://www.mediawiki.org/wiki/Alternative_parsers that claim to handle templates but they
did not work out mainly due the same problems that I had parsing wikitext myself. Now, I
am considering parsing the html of a wikipedia article. I tried also the MediaWiki API
(https://www.mediawiki.org/wiki/API:Parsing_wikitext) in order to retrieve the html for a
article and parse it myself but the API is very slow for previous revisions of an article
and it will take me forever. My question has two parts:
1. What is the fastest way  to get the html of an article for specific revision or what is
the best tool to setup local copy of Wikipedia (currently I am experimenting with Xowa and
Wikitaxi).
2. Is somebody aware of a html Wikipedia parser that could provide e.g. the position of
link or a classification of the links regarding their position in text (in each section),
if a link is in a infobox and if it is in a navibox.

If you think there is a better way to get a classification of the links regarding their
position than to parse the html of an article please let me know.

Cheers Dimi

GESIS - Leibniz Institute for the Social Sciences
GESIS Cologne
da|ra - Registration Agency for Social and Economic Data
Unter Sachsenhausen 6-8
D- 50667 Cologne
Tel: +49 221 47694 512