On 31.08.2016 22:14, Sumit Asthana wrote:
Hi, I've written a code to scrape Wikidata dump following Wikidata Toolkit examples.
In processItemDocument, I have extracted the target entityId for the property 'instanceof' for the current item. However I'm unable to find a way to get the label of the target entity given that I have the entityId, but not the entityDocument? Help would be appreciated :)
When you process a dump, you don't have random access to the data of all entities -- you just get to see them in order. Depending on your situation, there are several ways to go forward:
(1) You can use the Wikidata Toolkit API support to query the labels from Wikidata. This can be done in bulk at the end of the dump processing (fewer requests, since you can ask for many labels at once), or you can do it each time you need a label (more requests, slower, but easiest to implement). In the latter case, you should probably cache labels locally in a hashmap or similar to avoid repeated request.
This solution works well if you have a small or medium amount of labels. Otherwise, the API requests will take too long to be practical. Moreover, this solution will give you *current* labels from Wikidata. If you want to make sure that the labels are at a similar revision as your dump data (e.g., for historic analyses), then you must get them from the dump, not from the Web.
(2) If you need large amounts of labels (in the order of millions), then Web requests will not be practical. In this case, the easiest solution is to process the dump twice: first you collect all qids that you care about, second you gather all of their labels. Takes twice the time, but is very scalable: it will work for all data sizes (provided you can store the qids/labels while your program is running; if your local memory is very limited, you will need to use a database for this, which would slow down things more).
(1+2) You can do a combined approach of (1) and (2): do a single pass; remember all ids that you need labels for; if you find such an id in the dump, store the label; for ids that you did not find (because they occurred before you knew you needed them), do Web API queries after the dump processing.
(3) If you need to run such analyses a lot, you could also build up a label database locally: just write a small program that processes the dump and stores the label(s) for each id in a on-disk database. Then your actual program can get the labels from this database rather than asking the API. If your label set is not so large, you can also store the labels in a file that you load into memory when you need it. In fact, for the case of "class" items (things with an incoming P31 link), you can find such a file online:
http://tools.wmflabs.org/sqid/data/classes.json
It contains some more information, but also all English labels. This is 26M, so quite manageable.
(4) If the items that you need labels for can be described easily (e.g., "all items with incoming P31 links") and are not too many (e.g., around 100000), then you can use SPARQL to get all labels at once. This may (sometimes) time out if the result set is big. For example, the following query gets you all P31-targets + the number of their direct "best rank" instances:
SELECT ?cl ?clLabel ?c WHERE { { SELECT ?cl (count(*) as ?c) WHERE { ?i wdt:P31 ?cl } GROUP BY ?cl } SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . } }
Do *not* run this in your browser! There are too many results to display. Use the query service API programmatically instead. This query times out in as much as half of the cases, but so far I could always get it to return a complete result after a few attempts (you have to wait for at least 60sec before trying again).
My applications now do a single pass in WDTK for only the "hard" things, and then complete the output file using (4) with a Python script filling in labels. If the Python script's query does not time out, then the update of all labels takes less than a minute in this way. We had an implementation of (1+2) at some point, but it was more complicated to program and less efficient in this case. We did not have a reason to do (3) since we process each dump only once, so the effort of creating a label file does not pay off compared to (2).
Best regards,
Markus