Hi Markus et al.

Thank you for the answer. I have a few follow-up questions as I'm not quite grasping the toolkit.

Alternative 1:

So, if I'd like to do 1) I need a dump file, I've downloaded a *-current dump (http://dumps.wikimedia.org/wikidatawiki/20150330/wikidatawiki-20150330-pages-meta-current.xml.bz2) and am trying to process it using the DumpProcessingController class - which I'm assuming is the wrong way to go about this.

Is there a guide on how to parse local dumps?

Alternative 2:

I've been looking at the FetchOnlineDataExample and this seems to do pretty much what I need, except for retrieving interlanguage links for a page given the entity title - not the id. Is this possible, or is there a possibility of getting the entity id given a page title in a given language?

Thanks

Alan

Alan Said

Recorded Future

e: alansaid@acm.org

t: @alansaid

w: www.alansaid.com

On Fri, Apr 17, 2015 at 5:17 PM, Markus Krötzsch <markus@semantic-mediawiki.org> wrote:

Hi Alan,

The SitelinksExample shows how to get the basic language-links data. In Wikidata, sites are encoded by IDs such as "enwiki" or "frwikivoyage". To find out what they mean in terms of URLs, you need to get the interlanguage information first. The example shows you how to do this.

The site link information for a particular item can be found in the ItemDocument for that item. There are two ways of getting an ItemDocument:

(1) You process the dump file to process all items one by one (in the order in which they appear in the dump). This is best if you want to look at very many items, or if you must work completely in offline mode.
(2) You fetch individual items from the Web API individually (random access). This is best if you only need the links for a few selected items only (fetching hundreds from the API is quick, fetching millions is infeasible).

You can find many examples for doing things along the lines of (1) with WDTK. For (2), see the example FetchOnlineDataExample (this is only part of the development version of v0.5.0 so far, which you can find on github).

In either case, you can direclty read out any sitelink from the ItemDocument object. It will give you the article title, the site id ("enwiki" etc.), and the list of badges (if any). To turn this into a URL, you would use code as in the SitelinksExample.

Cheers,

Markus

On 17.04.2015 15:18, Alan Said wrote:

Hi all,
I am trying to use the Wikidata Toolkit to extract interlanguage links
for certain pages from Wikipedia.

So far, I've tried different attempts based on the code provided in
SiteLinksExample
(https://github.com/Wikidata/Wikidata-Toolkit/blob/master/wdtk-examples/src/main/java/org/wikidata/wdtk/examples/SitelinksExample.java)
without any success. I've realized that this is likely not the correct
approach.

Optimally I'd like to do this while processing a local file, I've
downloaded a pages-meta-current.xml.bz2 file, but I can't really get my
head around how to go ahead with this.
Any pointers are appreciated.

Best,
Alan

--
Alan Said
Recorded Future
e: alansaid@acm.org <mailto:alansaid@acm.org>
t: @alansaid
w: www.alansaid.com <http://www.alansaid.com>

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l