Hello,
I need some help. I have to classify the wikilinks in a Wikipedia article based on their relative position in the article (in best case on the rendered page). For each wikilink I would like to have something like the position in text (ascending for each section), if it is in a infobox and if it is in a navibox. I need this classification for a specific revision of every article in the English Wikipedia in the zero namespace . I tried out to do it by parsing the wikitext, but there are some problems with replacing the templates. For example if a template is embedded with parameters and/or with conditions it is a bit difficult to know what exactly is rendered. I tried out some parser from https://www.mediawiki.org/wiki/Alternative_parsers that claim to handle templates but they did not work out mainly due the same problems that I had parsing wikitext myself. Now, I am considering parsing the html of a wikipedia article. I tried also the MediaWiki API (https://www.mediawiki.org/wiki/API:Parsing_wikitext) in order to retrieve the html for a article and parse it myself but the API is very slow for previous revisions of an article and it will take me forever. My question has two parts: 1. What is the fastest way to get the html of an article for specific revision or what is the best tool to setup local copy of Wikipedia (currently I am experimenting with Xowa and Wikitaxi). 2. Is somebody aware of a html Wikipedia parser that could provide e.g. the position of link or a classification of the links regarding their position in text (in each section), if a link is in a infobox and if it is in a navibox.
If you think there is a better way to get a classification of the links regarding their position than to parse the html of an article please let me know.
Cheers Dimi
GESIS - Leibniz Institute for the Social Sciences GESIS Cologne da|ra - Registration Agency for Social and Economic Data Unter Sachsenhausen 6-8 D- 50667 Cologne Tel: +49 221 47694 512
On Wed, Sep 30, 2015 at 3:35 AM, Dimitrov, Dimitar < Dimitar.Dimitrov@gesis.org> wrote:
- What is the fastest way to get the html of an article for specific
revision or what is the best tool to setup local copy of Wikipedia (currently I am experimenting with Xowa and Wikitaxi).
You can use the REST API to fetch article html by revision (see: https://en.wikipedia.org/api/rest_v1/?doc).
For example: https://en.wikipedia.org/api/rest_v1/page/html/Main%20Page/664887982
The output this produces is generated by parsoid (see: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec).
https://doc.wikimedia.org/Parsoid/master/#!/guide/jsapi also gives a nice interface to walk a document structure, including recursing into template arguments & etc. It could be made much faster by fetching content from RESTBase.
Note that links generated by templates are a sort of special case. Do you want only links which appear in the *arguments* to the template? Or do you want links are contained in the template itself? These cases are slightly different. --scott
On Wed, Sep 30, 2015 at 9:44 AM, Eric Evans eevans@wikimedia.org wrote:
On Wed, Sep 30, 2015 at 3:35 AM, Dimitrov, Dimitar < Dimitar.Dimitrov@gesis.org> wrote:
- What is the fastest way to get the html of an article for specific
revision or what is the best tool to setup local copy of Wikipedia (currently I am experimenting with Xowa and Wikitaxi).
You can use the REST API to fetch article html by revision (see: https://en.wikipedia.org/api/rest_v1/?doc).
For example: https://en.wikipedia.org/api/rest_v1/page/html/Main%20Page/664887982
The output this produces is generated by parsoid (see: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec).
-- Eric Evans eevans@wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org