Re: [Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)

15 Aug 2018

Hoi,
May I remind you all that as it is, particularly the "descriptions" are
really problematic. They are often created based on Wikipedia categories
and it is quite rare that they get updated. Compare this with the
"automated descriptions" that have been around for years.

When new properties are added to an item, it may change the automated
description as a result and, this is reflected in any language. These
changed descriptions may be stored until the next update on the item, they
may be generated when needed and obviously they may be cached. They may be
used in the build up of a search and this will be a much bigger incentive
for people to update labels.

Contrary what some think, labels are updated based on a "need", this need
is hardly there because Wikidata only appeals to geeks. It is why the
Reasonator approach to labelisation makes so much sense. You see the
missing labels, you add them and the next item will show the new labels.
Given that people work in domains, it is a sound approach and, this will
also quite quickly improve the quality of "automated descriptions" in any
language.

Did I tell you that I disambiguate items by adding labels and properties in
Wikidata? In Reasonator when you refresh a "search" you will see for
instance a date of birth death added making John Smith *that* John Smith,

Obviously, search could be a lot better and using "automated descriptions"
will make a positive difference.
Thanks,
       GerardM

On 15 August 2018 at 07:20, Stas Malyshev &lt;smalyshev(a)wikimedia.org&gt; wrote:

...
  Hi!

https://everypageispageone.com/2011/07/13/search-vs-query/ ). Currently
 our query service is a very strong and complete service, but Wikidata
 search is very poor. Let's take Blade Runner. 
 I don't think it's *very* poor anymore, but it certainly can be better.

  In my ideal world, everything I see as a human
gets indexed into the
 search engine preferably in a per language index. For example for Dutch 
 Err.... The problem is that what you see as a human and what search
 engine uses for lookups are very different things. While for text
 articles it is similar, for structured data it's quite different, and
 treating structured data the same way as text is not going to produce
 good results, partially because most search algorithms make assumptions
 that come from text world, partially because we'd be ignoring useful
 clues present in structured data.

  something like a text_nl field with the, label,
description, aliases,
 statements and references in there. So index *everything* and never see 
 There are such fields, but it makes no sense to put references there,
 because there's no such thing as "Dutch reference". References do not
 change with language.

  a Qnumber or Pnumber in there (extra incentive
for people to add labels
 in their language). Probably also everything duplicated in the text 
 That presents a problem. While you see "instance of": "human", the
data
 is P31:Q5. We can, of course, put "instance of": "human" in the
index.
 But what if label for Q5 changes? Now we have to re-index 10 million
 records. And while we're doing it, what if another label for such item
 changes again? We'd have to start another million-size reindex. In a
 week, we'd have a backlog of hopeless size, or will require processing
 power that we just don't have. Note also that ElasticSearch doesn't
 really do document updates - it just writes a new document. So frequent
 updates to the same document is not its optimal scenario, and we're
 talking about propagating each label edit to each item that is linked to
 that one. I'm afraid that would explode on us very quickly.

 The problem is not indexing labels, the problem is keeping them
 up-to-date on 50 million interlinked items.

 When displaying, it's easy - you don't need to worry until you show it,
 and most items are shown only rarely. Even then you see a label out of
 date now and then. But with search, you can't update label on use - when
 you want to use it (i.e. look up), it should already be up-to-date,
 otherwise it's useless.

  As for implementation: We already have the logic
to serialize our json
 to the RDF format. Maybe also add a serialization format for this that
 is easy to ingest by search engines? 
 I don't know any such special format, do you? We of course have JSON
 updates to ElasticSearch, but as I noted before, updates are the problem
 there, not format. RDF of course also does not carry denormalized data,
 so we also update only entries that need updating, and fetch labels on
 use. We can not do it for search index. I don't think format here is the
 problem.

  . Making it easier to index not only for our own
search would be a nice
 added benefit. 
 Sure, but experience have shown that the strategy of "dump everything
 into one huge text" works very poorly in Wikidata. That's why we
 implemented specialized search that knows about how the structured data
 works. If the search sucks less now than it did before, that's the reason.

  How feasible is this? Do we already have one or
multiple tasks for this
 on Phabricator? Phabricator has gotten a bit unclear when it comes to
 Wikidata search, I think because of misunderstanding between people what
 the goal of the task is. Might be worthwhile spending some time on
 structuring that. 
 Wikidata search tasks would be under "Wikidata" +
"Discovery-Search".
 There are multiple tasks for it, but if you want to add any, please feel
 welcome to browse and add.

 --
 Stas Malyshev
 smalyshev(a)wikimedia.org

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)