Re: [Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)

15 Aug 2018


      Hi!
...
That presents a problem. While you see "instance of": "human", the data
is P31:Q5. We can, of course, put "instance of": "human" in the index.
But what if label for Q5 changes? Now we have to re-index 10 million
records.


I haven't thought this through, but would it be possible to index just
Q5, and then when someone searches on "human" to see what are all the
items with the label "human", so that the search becomes "human OR Q5"?
That has a potential to explode pretty quickly. Consider query like
"movie Bruce Willis" - where obviously you want all movies where Bruce
Willis starred. Now, if we search for "movie", we get tons of potential
matches. If we search for "Bruce" and "Willis" - even more. Now if we
stuff all those IDs we've received in our query we'll get something very
far from what you intended, and the relevance would be pretty bad. Not
to mention you have to actually run four queries instead of one (4x
load) and the last one is pretty fat, stuffed with all the IDs we've
gathered.
But that's not the end of it - you don't just want any item that is
somehow related to movies - you want items that *are* movies. And you
don't want any item that is somehow related to somebody named "Bruce" or
"Willis". You want the ones where the famous actor Bruce Willis played
(or maybe directed). But there's no such information in the query.
-- 
Stas Malyshev
smalyshev@wikimedia.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)