Hello!
Context:
For the needs of inventaire.io, I'm working on a type-filtered
autocomplete, that is, a field with suggestions but with suggestions
matching a given claim, typically an "author" input where I would
like to suggest only entities that match the claim P31:Q5 (instance
of -> human).
The dream would be to have "filter" option in the wbsearchentities
module, to be able to do things like
https://www.wikidata.org/w/api.php?action=wbsearchentities&limit=10&format=json&search=victor&filter=P31:Q5
As far as I know, this isn't possible yet. One could search without
filter, then fetch the related entities with their claims data, then
filter on those claims, but this is rather slow for such an
autocomplete feature that needs to be snappy. So the alternative
approach I have been working on to is to get a subset of a Wikidata
dump and put it in an ElasticSearch instance.
Question:
What is the best way to get all the entities matching a given
claim?
My answer so far was downloading a dump, then filtering the entities
by claim, but are there better/less resource-intensive ways?
The only other alternative I see would be a SPARQL query without
specifying a LIMIT (which in the case of P31:Q5 is probably in the
millions(?)) to get all the desired ids, then using wbgetentities to
get the data 50 by 50 to work around the API limitations, but those
limitations are there for something right?
As those who manage the servers that would be stressed by one or the
other way, what seems the less painful to recommend? ^^
Thanks in advance for any clue!
New tools:
- To make a filtered dump, I wrote a small command-line tool: wikidata-filter
It can filter a dump but also any set of Wikidata entities in a
newline-delimited json file, hope it can be helpful to other people!
- The whole search engine setup can be found here: wikidata-subset-search-engine
Clues and comments welcome!
Greetings,
Maxime