Hi all, I am trying to use the Wikidata Toolkit to extract interlanguage links for certain pages from Wikipedia.
So far, I've tried different attempts based on the code provided in SiteLinksExample ( https://github.com/Wikidata/Wikidata-Toolkit/blob/master/wdtk-examples/src/m...) without any success. I've realized that this is likely not the correct approach.
Optimally I'd like to do this while processing a local file, I've downloaded a pages-meta-current.xml.bz2 file, but I can't really get my head around how to go ahead with this. Any pointers are appreciated.
Best, Alan
2015-04-17 15:18 GMT+02:00 Alan Said alansaid@acm.org:
alansaid@acm.org
Are you the same guy that asked on project chat ?
To use the API you may find something helpful https://www.mediawiki.org/wiki/Manual:Pywikibot/Wikidata on the pywikibot help page. Hope it helps.
Hi Alan,
The SitelinksExample shows how to get the basic language-links data. In Wikidata, sites are encoded by IDs such as "enwiki" or "frwikivoyage". To find out what they mean in terms of URLs, you need to get the interlanguage information first. The example shows you how to do this.
The site link information for a particular item can be found in the ItemDocument for that item. There are two ways of getting an ItemDocument:
(1) You process the dump file to process all items one by one (in the order in which they appear in the dump). This is best if you want to look at very many items, or if you must work completely in offline mode. (2) You fetch individual items from the Web API individually (random access). This is best if you only need the links for a few selected items only (fetching hundreds from the API is quick, fetching millions is infeasible).
You can find many examples for doing things along the lines of (1) with WDTK. For (2), see the example FetchOnlineDataExample (this is only part of the development version of v0.5.0 so far, which you can find on github).
In either case, you can direclty read out any sitelink from the ItemDocument object. It will give you the article title, the site id ("enwiki" etc.), and the list of badges (if any). To turn this into a URL, you would use code as in the SitelinksExample.
Cheers,
Markus
On 17.04.2015 15:18, Alan Said wrote:
Hi all, I am trying to use the Wikidata Toolkit to extract interlanguage links for certain pages from Wikipedia.
So far, I've tried different attempts based on the code provided in SiteLinksExample (https://github.com/Wikidata/Wikidata-Toolkit/blob/master/wdtk-examples/src/m...) without any success. I've realized that this is likely not the correct approach.
Optimally I'd like to do this while processing a local file, I've downloaded a pages-meta-current.xml.bz2 file, but I can't really get my head around how to go ahead with this. Any pointers are appreciated.
Best, Alan
-- Alan Said Recorded Future e: alansaid@acm.org mailto:alansaid@acm.org t: @alansaid w: www.alansaid.com http://www.alansaid.com
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hi Markus et al. Thank you for the answer. I have a few follow-up questions as I'm not quite grasping the toolkit.
Alternative 1: So, if I'd like to do 1) I need a dump file, I've downloaded a *-current dump ( http://dumps.wikimedia.org/wikidatawiki/20150330/wikidatawiki-20150330-pages...) and am trying to process it using the DumpProcessingController class - which I'm assuming is the wrong way to go about this. Is there a guide on how to parse local dumps?
Alternative 2: I've been looking at the FetchOnlineDataExample and this seems to do pretty much what I need, except for retrieving interlanguage links for a page given the entity title - not the id. Is this possible, or is there a possibility of getting the entity id given a page title in a given language?
Thanks
Alan
On 21.04.2015 17:31, Alan Said wrote:
Hi Markus et al. Thank you for the answer. I have a few follow-up questions as I'm not quite grasping the toolkit.
Alternative 1: So, if I'd like to do 1) I need a dump file, I've downloaded a *-current dump (http://dumps.wikimedia.org/wikidatawiki/20150330/wikidatawiki-20150330-pages...) and am trying to process it using the DumpProcessingController class - which I'm assuming is the wrong way to go about this. Is there a guide on how to parse local dumps?
Right now, the main way of using WDTK is to have it download the dumps automatically. They will then be put into the place where the program is looking for them. As you know, using arbitrary local files in other places will hopefully be supported soon:
https://github.com/Wikidata/Wikidata-Toolkit/issues/136
Alternative 2: I've been looking at the FetchOnlineDataExample and this seems to do pretty much what I need, except for retrieving interlanguage links for a page given the entity title - not the id. Is this possible, or is there a possibility of getting the entity id given a page title in a given language?
This is another small extension that is currently up for grabs:
https://github.com/Wikidata/Wikidata-Toolkit/issues/138
Cheers,
Markus