I could definitely see a usecase for 1) and maybe for 2). For example,
let's say i remember that one movie that Rutger Hauer played in, just
searching for 'movie rutger hauer' gives back nothing:
https://www.wikidata.org/w/index.php?search=movie+rutger+hauer
While Wikipedia gives back quite a nice list of options:
https://en.wikipedia.org/w/index.php?search=movie+rutger+hauer
If we would index item properties as well, you could get back Blade
Runner (Q184843) which has Rutger Hauer as one of its 'cast member'
values.
Of course, you could also do this by using the Query Service, but
writing SPARQL is a really high barrier, and it's a lot easier to just
type in a few keywords if you're looking for something.
I can understand that 2) might be a bit too hard on the system, and i
think there's some value in that too, but having only 1) would be a
big improvement over the current state of things.
Groet,
-- Hay
On Thu, Jul 26, 2018 at 11:48 PM Stas Malyshev <smalyshev(a)wikimedia.org> wrote:
Hi!
Today we are indexing in ElasticSearch almost all string properties
(except a few) and select item properties (P31 and P279). We've been
asked to extend this set and index more item properties
(
https://phabricator.wikimedia.org/T199884). We did not do it from the
start because we did not want to add too much data to the index at once,
and wanted to see how the index behaves. To evaluate what this change
would mean, some statistics:
All usage of item properties in statements is about 231 million uses
(according to sqid tool database). Of those, about 50M uses are
"instance of" which we are already indexing. Another 98M uses belong to
two properties - published in (P1433) and cites (P2860). Leaving about
86M for the rest of the properties.
So, if we index all the item properties except P2860 and P1433, we'll be
a little more than doubling the amount of data we're storing for this
field, which seems OK. But if we index those too, we'll be essentially
quadrupling it - which may be OK too, but is bigger jump and one that
may potentially cause some issues.
So, we have two questions:
1. Do we want to enable indexing for all item properties? Note that if
you just want to find items with certain statement values, Wikidata
Query Service matches this use case best. It's only in combination with
actual fulltext search where on-wiki search is better.
2. Do we need to index P2860 and P1433 at all, and if so, would it be ok
if we omit indexing for now?
Would be glad to hear thoughts on the matter.
Thanks,
--
Stas Malyshev
smalyshev(a)wikimedia.org
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata