[Wikitech-l] Wikilink Classification and Parsing

30 Sep 2015


      Hello,
I need some help. I have to classify the wikilinks in a Wikipedia article based on their relative position in the article (in best case on the rendered page). For each wikilink I would like to have something like the position in text (ascending for each section), if it is in a infobox and if it is in a navibox. I need this classification for a specific revision of every article in the English Wikipedia in the zero namespace . I tried out to do it by parsing the wikitext, but there are some problems with replacing the templates. For example if a template is embedded with parameters and/or with conditions it is a bit difficult to know what exactly is rendered. I tried out some parser from https://www.mediawiki.org/wiki/Alternative_parsers that claim to handle templates but they did not work out mainly due the same problems that I had parsing wikitext myself. Now, I am considering parsing the html of a wikipedia article. I tried also the MediaWiki API (https://www.mediawiki.org/wiki/API:Parsing_wikitext) in order to retrieve the html for a article and parse it myself but the API is very slow for previous revisions of an article and it will take me forever. My question has two parts:
1. What is the fastest way  to get the html of an article for specific revision or what is the best tool to setup local copy of Wikipedia (currently I am experimenting with Xowa and Wikitaxi).
2. Is somebody aware of a html Wikipedia parser that could provide e.g. the position of link or a classification of the links regarding their position in text (in each section), if a link is in a infobox and if it is in a navibox.
If you think there is a better way to get a classification of the links regarding their position than to parse the html of an article please let me know.
Cheers Dimi
GESIS - Leibniz Institute for the Social Sciences
GESIS Cologne
da|ra - Registration Agency for Social and Economic Data
Unter Sachsenhausen 6-8
D- 50667 Cologne
Tel: +49 221 47694 512

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Wikilink Classification and Parsing