Hi!
https://everypageispageone.com/2011/07/13/search-vs-query/ ). Currently
our query service is a very strong and complete service, but Wikidata
search is very poor. Let's take Blade Runner.
I don't think it's *very* poor anymore, but it certainly can be better.
In my ideal world, everything I see as a human gets
indexed into the
search engine preferably in a per language index. For example for Dutch
Err.... The problem is that what you see as a human and what search
engine uses for lookups are very different things. While for text
articles it is similar, for structured data it's quite different, and
treating structured data the same way as text is not going to produce
good results, partially because most search algorithms make assumptions
that come from text world, partially because we'd be ignoring useful
clues present in structured data.
something like a text_nl field with the, label,
description, aliases,
statements and references in there. So index *everything* and never see
There are such fields, but it makes no sense to put references there,
because there's no such thing as "Dutch reference". References do not
change with language.
a Qnumber or Pnumber in there (extra incentive for
people to add labels
in their language). Probably also everything duplicated in the text
That presents a problem. While you see "instance of": "human", the
data
is P31:Q5. We can, of course, put "instance of": "human" in the
index.
But what if label for Q5 changes? Now we have to re-index 10 million
records. And while we're doing it, what if another label for such item
changes again? We'd have to start another million-size reindex. In a
week, we'd have a backlog of hopeless size, or will require processing
power that we just don't have. Note also that ElasticSearch doesn't
really do document updates - it just writes a new document. So frequent
updates to the same document is not its optimal scenario, and we're
talking about propagating each label edit to each item that is linked to
that one. I'm afraid that would explode on us very quickly.
The problem is not indexing labels, the problem is keeping them
up-to-date on 50 million interlinked items.
When displaying, it's easy - you don't need to worry until you show it,
and most items are shown only rarely. Even then you see a label out of
date now and then. But with search, you can't update label on use - when
you want to use it (i.e. look up), it should already be up-to-date,
otherwise it's useless.
As for implementation: We already have the logic to
serialize our json
to the RDF format. Maybe also add a serialization format for this that
is easy to ingest by search engines?
I don't know any such special format, do you? We of course have JSON
updates to ElasticSearch, but as I noted before, updates are the problem
there, not format. RDF of course also does not carry denormalized data,
so we also update only entries that need updating, and fetch labels on
use. We can not do it for search index. I don't think format here is the
problem.
. Making it easier to index not only for our own
search would be a nice
added benefit.
Sure, but experience have shown that the strategy of "dump everything
into one huge text" works very poorly in Wikidata. That's why we
implemented specialized search that knows about how the structured data
works. If the search sucks less now than it did before, that's the reason.
How feasible is this? Do we already have one or
multiple tasks for this
on Phabricator? Phabricator has gotten a bit unclear when it comes to
Wikidata search, I think because of misunderstanding between people what
the goal of the task is. Might be worthwhile spending some time on
structuring that.
Wikidata search tasks would be under "Wikidata" + "Discovery-Search".
There are multiple tasks for it, but if you want to add any, please feel
welcome to browse and add.
--
Stas Malyshev
smalyshev(a)wikimedia.org