Hi,
I am looking for the most efficient way of getting the following information out of WDQS:
* One language only (e.g. fr.wikipedia.org) * All instances of human (e.g. of the abstraction: wd:Q9916|Dwight David Eisenhower|États-Unis|Dwight|Eisenhower|https://fr.wikipedia.org/wiki/Dwight_D._Eisenhower https://fr.wikipedia.org/wiki/Dwight_D._Eisenhower|militaire américain, président des États-Unis)
Let's say we have a list of all sovereign states (Q16, Q30, Q142, ...) and all letters of the requested language (French: a, b, c, ...) , we can automate requests and get a lot of results. Unfortunately, it's costly and not efficient. It takes about a day to succeed.
|SELECT ?person ?personLabel ?countryLabel ?givenNameLabel ?familyNameLabel ?article ?persondesc|| ||WHERE|| ||{|| || ?person wdt:P31 wd:Q5;|| || wdt:P27 wd:Q30;|| || wdt:P27 ?country;|| || wdt:P734 ?familyName;|| || wdt:P735 ?givenName ;|| || rdfs:label ?personLabel.|| || ?familyName rdfs:label ?familyNameLabel.|| || ?country rdfs:label ?countryLabel.|| || ?givenName rdfs:label ?givenNameLabel.|| || ?person schema:description ?persondesc.|| || FILTER(LANG(?personLabel) = "fr").|| || FILTER(LANG(?familyNameLabel) = "en").|| || FILTER(LANG(?countryLabel) = "fr").|| || FILTER(LANG(?givenNameLabel) = "en").|| || FILTER(LANG(?persondesc) = "fr").|| || FILTER(STRSTARTS(?personLabel, "D")).|| || FILTER(STRSTARTS(?familyNameLabel, "E")).|| |||| || ?article schema:about ?person;|| || schema:inLanguage "fr";|| || schema:isPartOf https://fr.wikipedia.org/ . || ||}| https://query.wikidata.org/#SELECT%20%3Fperson%20%3FpersonLabel%20%3FcountryLabel%20%3FgivenNameLabel%20%3FfamilyNameLabel%20%3Farticle%20%3Fpersondesc%0AWHERE%0A%7B%0A%20%20%3Fperson%20wdt%3AP31%20wd%3AQ5%3B%0A%20%20%20%20%20%20%20%20%20%20%23wdt%3AP21%20wd%3AQ6581097%3B%0A%20%20%20%20%20%20%20%20%20%20wdt%3AP27%20wd%3AQ30%3B%0A%20%20%20%20%20%20%20%20%20%20wdt%3AP27%20%3Fcountry%3B%0A%20%20%20%20%20%20%20%20%20%20wdt%3AP734%20%3FfamilyName%3B%0A%20%20%20%20%20%20%20%20%20%20wdt%3AP735%20%3FgivenName%20%3B%0A%20%20%20%20%20%20%20%20%20%20rdfs%3Alabel%20%3FpersonLabel.%0A%20%20%3FfamilyName%20rdfs%3Alabel%20%3FfamilyNameLabel.%0A%20%20%3Fcountry%20rdfs%3Alabel%20%3FcountryLabel.%0A%20%20%3FgivenName%20rdfs%3Alabel%20%3FgivenNameLabel.%0A%20%20%3Fperson%20schema%3Adescription%20%3Fpersondesc.%0A%20%20FILTER%28LANG%28%3FpersonLabel%29%20%3D%20%22fr%22%29.%0A%20%20FILTER%28LANG%28%3FfamilyNameLabel%29%20%3D%20%22en%22%29.%0A%20%20FILTER%28LANG%28%3FcountryLabel%29%20%3D%20%22fr%22%29.%0A%20%20FILTER%28LANG%28%3FgivenNameLabel%29%20%3D%20%22en%22%29.%0A%20%20FILTER%28LANG%28%3Fpersondesc%29%20%3D%20%22fr%22%29.%0A%20%20FILTER%28STRSTARTS%28%3FpersonLabel%2C%20%22D%22%29%29.%0A%20%20FILTER%28STRSTARTS%28%3FfamilyNameLabel%2C%20%22E%22%29%29.%0A%20%20%0A%20%20%3Farticle%20schema%3Aabout%20%3Fperson%3B%0A%20%20%20%20%20%20%20%20%20%20%20schema%3AinLanguage%20%22fr%22%3B%0A%20%20%20%20%20%20%20%20%20%20%20schema%3AisPartOf%20%3Chttps%3A%2F%2Ffr.wikipedia.org%2F%3E%20.%0A%20%20%23SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22fr%22.%20%7D%0A%20%20%0A%7D%0A%0AORDER%20BY%20%3FfamilyNameLabel
https://query.wikidata.org/#SELECT%20%3Fperson%20%3FpersonLabel%20%3Fcountry...
Such a request takes an average of 20 seconds to complete.
Any help will be much appreciated. Thanks for your time.
Justin
Hi!
On 5/15/18 3:27 PM, Justin Maltais wrote:
Hi,
I am looking for the most efficient way of getting the following information out of WDQS:
* One language only (e.g. fr.wikipedia.org) * All instances of human (e.g. of the abstraction: wd:Q9916|Dwight David
Let's say we have a list of all sovereign states (Q16, Q30, Q142, ...) and all letters of the requested language (French: a, b, c, ...) , we can automate requests and get a lot of results. Unfortunately, it's costly and not efficient. It takes about a day to succeed.
The first thing I would like to ask is please don't do that again. This created a significant load on the server, the script completely ignored the throttling headers we sent, and in the future we would ban such clients for extended periods of time, to prevent harm to the service. If your client can not abide by 429/Retry-After headers, please do not run it in automated repeated fashion until it either can handle them properly, or insert delays long enough so you can be sure you are not launching an avalanche of heavy requests and crowding out other users.
If something takes too long, that's a good moment to ask for help, not to put it in a loop that would hit the server repeatedly for days.
If you need to deal with a massive data set that needs to be processed, I would suggest trying the following strategy:
1. Load the primary key data - like list of all humans if that's what you need - to your own storage. You can use either LDF server or parsing the dump directly for that for Q5 (maybe with Wikidata Toolkit?). For some scenarios, even direct query would be fine, but for Q5 it probably would be too much.
2. Split this data set into palatable batches - like 100 items per batch or so, you can experiment on that, it's fine to cause a couple of timeouts if it's not an automated script doing it 20 times a second for a long time. Once you have sane batch size, run the query that needs to fetch other data using VALUES clause to substitute primary key data. Watch the 429 responses - if you're getting them, insert delays or lower batch size, or ask for help again if it doesn't work.
Alternatively, segmenting the records by some other criteria may work too, but I don't think filter like STRSTARTS(?personLabel, "D")) is going to be effective - I don't think Blazegraph query optimizer is smart enough to convert this to index lookup, and without that, this is just slowing things down by introducing more checks in the query. And even if it did, there's a lot of labels starting with "D", so that probably won't be too useful for speeding it up.
Having said that, I am curious - what exactly you are doing with this data set? Why you need a list of all humans - how this list is going to be used? Knowing that may help to devise better specialized strategy of achieving the same.
Stas,
That is really good info and ideally should also go under https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_limits
and I would say that even better would be a new page about "Best Practices" should be made and added under "First Steps" section here https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/Wikidata_Query_H... where that "Best Practices" page would also have a link and blurb about "query limits" page.
-Thad