On 20/10/2018 00:41, Stas Malyshev wrote:
Cparle wants to make sure that people searching
for "clarinet" also get
shown images of "piccolo clarinet" etc.
To make this possible, where an image has been tagged "basset horn" he
is therefore looking to add "clarinet" as an additional keyword, so that
if somebody types "clarinet" into the search box, one of the images
retrieved by ElasticSearch will be the basset horn one.
Generally if the image is tagged with "basset horn" and the user query
is "clarinet", we can do one of the following:
1. Index all upstream-hierarchy for "basset horn" (presumably we would
have to cut off when it gets too deep or too abstract) and then match
directly when searching.
2. Expand hierarchy down-stream from "clarinet" and then match against
3. Have some manual or automatic process that ensures that both
"clarinet" and "basset horn" are indexed (not necessarily at once)
rely on it to discover the matches.
The problem with (1) is that if hierarchy changes, we will have to do
huge number of updates which might overwhelm the system, and most of
these updates would be not even for things people search for, but we
have no way to know that.
The problem with (2) is that downstream hierarchies explode very fast,
and if you search for "clarinet" and there are 10000 descendants in
these hierarchies, we can't search for all of them, so you may never get
a chance to find the basset horn. Also, of course, querying big
downstream hierarchies takes time too, which means performance hit.
Is this such a problem? It is what people now commonly do with P31/P279*
queries. For example, finding 10K instances of (some subclass of)
building takes 9 secs: http://tinyurl.com/y7e5j5sd
(I think this is one
of the more complex hierarchies; maybe you know larger downstream
hierarchies one could try?) If you omit the labels, it takes 650ms.
That's maybe not quite autocompletion speed yet, but seems acceptable
for a media search.