question on claim-filtered search/dump and works on a Wikidata subset search engine

List overview All Threads
Download

newer

older

"Damaging" and "goodfaith" models...

first media info entity \o/

Maxime Lathuilière

28 Apr 2016 28 Apr '16

5:26 a.m.

Attachments:

attachment.htm (text/html — 3.5 KB)

Show replies by date

Thomas Steiner

28 Apr 28 Apr

7:53 a.m.

[+Ruben Verborgh]

Salut Maxime,

I wonder if this is something Ruben's Linked Data Fragments (http://linkeddatafragments.org/) could solve in a fast enough manner?! I let Ruben chime in (if he wants).

Cheers, Tom

On Thu, Apr 28, 2016 at 12:26 PM, Maxime Lathuilière groups@maxlath.eu wrote:

...

Hello!

Context: For the needs of inventaire.io, I'm working on a type-filtered autocomplete, that is, a field with suggestions but with suggestions matching a given claim, typically an "author" input where I would like to suggest only entities that match the claim P31:Q5 (instance of -> human).

The dream would be to have "filter" option in the wbsearchentities module, to be able to do things like https://www.wikidata.org/w/api.php?action=wbsearchentities&limit=10&...

As far as I know, this isn't possible yet. One could search without filter, then fetch the related entities with their claims data, then filter on those claims, but this is rather slow for such an autocomplete feature that needs to be snappy. So the alternative approach I have been working on to is to get a subset of a Wikidata dump and put it in an ElasticSearch instance.

Question: What is the best way to get all the entities matching a given claim? My answer so far was downloading a dump, then filtering the entities by claim, but are there better/less resource-intensive ways? The only other alternative I see would be a SPARQL query without specifying a LIMIT (which in the case of P31:Q5 is probably in the millions(?)) to get all the desired ids, then using wbgetentities to get the data 50 by 50 to work around the API limitations, but those limitations are there for something right? As those who manage the servers that would be stressed by one or the other way, what seems the less painful to recommend? ^^

Thanks in advance for any clue!

New tools:

To make a filtered dump, I wrote a small command-line tool:

wikidata-filter It can filter a dump but also any set of Wikidata entities in a newline-delimited json file, hope it can be helpful to other people!

The whole search engine setup can be found here:

wikidata-subset-search-engine

Clues and comments welcome!

Greetings,

Maxime

-- Maxime Lathuilière maxlath.eu - twitter inventaire.io - roadmap - code - twitter - facebook wiki(pedia|data): Zorglub27 for personal emails use max@maxlath.eu instead

-- Dr. Thomas Steiner, Employee (http://blog.tomayac.com, https://twitter.com/tomayac) Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany Managing Directors: Matthew Scott Sucherman, Paul Terence Manicle Registration office and registration number: Hamburg, HRB 86891 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.29 (GNU/Linux) iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom hTtPs://xKcd.cOm/1181/ -----END PGP SIGNATURE-----

Ruben Verborgh

7:56 a.m.

New subject: question on claim-filtered search/dump and works on a Wikidata subset search engine

Hi Maxime,

(@Tom, thanks for pinging me.)

We have created a self-describing interface for literal search, which can be used for autocompletion. More details here: http://ruben.verborgh.org/publications/vanherwegen_iswc_2015/

Let me know if we can help you!

Best,

Ruben

Maxime Lathuilière

4:53 p.m.

Stas Malyshev

4 p.m.

Hi!

...

feature that needs to be snappy. So the alternative approach I have been working on to is to get a subset of a Wikidata dump and put it in an ElasticSearch instance.

The linked data fragments implementation would probably be useful for that, and I think it would be good idea to get one eventually for the Wikidata Query Service, but not yet. Also, we do have ElasticSearch index for Wikidata (that's what drives search on site) so it would be possible to integrate it with Query Service too (there's some support for it in Blazegraph) but it's still not done. So for now I think we don't have a ready-made solution yet. You could still try to prefix-search or regex-search on the query service, but depending on the query it may be too slow right now.

...

*Question: *What is the best way to get all the entities matching a given claim? My answer so far was downloading a dump, then filtering the entities by claim, but are there better/less resource-intensive ways?

Probably not currently without some outside tools. When we get LDF support, then that may be the way :)

-- Stas Malyshev smalyshev@wikimedia.org

Maxime Lathuilière

4:53 p.m.

@tom thanks for the connexion!

@ruben interesting! will read the full paper asap :)

@ruben @stas I'm not very familiar with Linked Data Fragments, so any additional links to get a better understanding of how this could help address this usecase is welcome!

Maxime

Le 28/04/2016 23:00, Stas Malyshev a écrit :

...

Hi!

...
feature that needs to be snappy. So the alternative approach I have been working on to is to get a subset of a Wikidata dump and put it in an ElasticSearch instance.

The linked data fragments implementation would probably be useful for that, and I think it would be good idea to get one eventually for the Wikidata Query Service, but not yet. Also, we do have ElasticSearch index for Wikidata (that's what drives search on site) so it would be possible to integrate it with Query Service too (there's some support for it in Blazegraph) but it's still not done. So for now I think we don't have a ready-made solution yet. You could still try to prefix-search or regex-search on the query service, but depending on the query it may be too slow right now.

...
*Question: *What is the best way to get all the entities matching a given claim? My answer so far was downloading a dump, then filtering the entities by claim, but are there better/less resource-intensive ways?

Probably not currently without some outside tools. When we get LDF support, then that may be the way :)

Ruben Verborgh

29 Apr 29 Apr

3:16 a.m.

New subject: question on claim-filtered search/dump and works on a Wikidata subset search engine

...

@ruben @stas I'm not very familiar with Linked Data Fragments, so any additional links to get a better understanding of how this could help address this usecase is welcome!

The best way to get started is to try it: http://client.linkeddatafragments.org/ (note that text filtering is not part of this demo)

This covers the most important topics: http://linkeddatafragments.org/in-depth/

Ruben

3159

Age (days ago)

3160

Last active (days ago)

wikidata@lists.wikimedia.org

6 comments

4 participants

tags (0)

participants (4)

Maxime Lathuilière
Ruben Verborgh
Stas Malyshev
Thomas Steiner