Re: [Wikidata-tech] [discovery-private] Wikidata fulltext search results output

25 Oct 2017

Interesting questions... my comments are inline.

On Tue, Oct 24, 2017 at 8:49 PM, Stas Malyshev &lt;smalyshev(a)wikimedia.org&gt;
wrote:

...
  Hi!

 As I am working on improving Wikidata fulltext search[1], I'd like to
 talk about search results page. Right now search results page for
 Wikidata is less than ideal, here are the issues I see with it:

 - No match highlighting

I think match highlighting would be nice, but I know it can be tricky in
the edge cases.

...
  - Meaningless data, like word count (anybody cares to
guess what it is
 counting? Anybody ever used it?) and byte count (more useful than word
 count but not by much)

I don't know who is interested in that, so I don't have a strong opinion.

...
  - Obviously, search quality is not super high, but
that should be
 improved with proper description indexing

 While working on improving the situation, I would like to solicit
 opinions on the set of questions about how the search results page
 should look like. Namely:

 1. If the match is made on label/description that does not match current
 display language, we could opt for:
 a) Displaying the description that matched, highlighted. Optionally
 maybe display the language of the match (in display language?)
 b) Displaying the description in display language, un-highlighted.
 Which option is preferable?

I would definitely like to see the label that matched. Even if you don't
know the language, seeing a partial match vs a full match is informative.
If I search for *Москва,* and I get back "Moscow" and "Armenian
Cemetery" I
don't know what's what. Seeing that Moscow is "Russian: *Москва*" and
Armenian Cemetery is "Russian: Армянское кладбище (*Москва*)" tells me
immediately that Moscow is probably a better match, even if I don't know
any Russian or Cyrillic.

There's a problem, though, which may be why this hasn't been done—*which* label
do you match? For Armenian Cemetery, both Russian and Ukrainian have "Москва"
in the label. For Moscow, there are 18 labels that are "Москва", another
one that is a partial match (Москва балһсн), another that's a folded
match (Мӧсква),
and three more that have exact matches in their additional labels
(including English). Unless you can define a hierarchy of
languages—possibly including user languages and the "native" language of an
entry—it's going to be hard to pick one. If I'd searched for *Moskva* and
didn't have English as a user language, it'd be impossible to choose one of
the 32 possible languages that are exact matches on the main label.
*Moskwa* also
doesn't match any of my user languages, or Russian, but does match a bunch
of other languages—how to choose?

Any names will have similar problems. "Jacek Moskwa" is the same in all 12
languages with a label. His descriptions say he's Polish, so I guess Polish
is the right answer, but I don't think there's any way to know that.

So, ideally, *I'd* like the name of the the language that had a label match
in my display language, with a highlight of the matching bit in the
description from the matched language—but I'm not sure there's a way to get
there. Picking the first one alphabetically that matches will give weird
results.

...

 2. What we do if the match is on alias? Do we display matching alias,
 original label or both? The question above also applies if the match is
 on other language alias.

I'd want to see the both, maybe as "West Germany (*FRG*)" if I search for
FRG—hey, the autocompletion suggester does that already!

...
  3. It looks clear to me that words count is useless.
Is byte count
 useful and does it need to be kept?

 4. Do we want to display any other parameters of the entity? E.g. we
 have in the index: statement_count, sitelink_count, label_count,
 incoming_links, etc. Do we want to display any?

Statement count is the one that is most interesting to me, but I wonder if
anyone really uses any of these stats. Someone must, but I don't know their
use cases.

...

 5. Display format for Wikidata and for other wikipedia sites is different:
 Wikpedia:

 Title
 Snippet

 Wikidata:

 Title: Description

 I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on
 the same line, separated by colon. Is there any reason for this
 difference? Do we want to go back to the common format?

I can see that "Title: Description" saves some vertical space, but I would
prefer the description to be on the next line.

...

 Also if you have any other things/ideas/comments about how fulltext
 search output for wikidata should be, please tell me.

Since Moscow has Москва as an additional label in English, I'm not sure if
I'd also want to see a line with "Russian: Москва", too, so I left it out
and used just the English alias for the city. I also got tired of counting
statements on the city, so I just made something up.

Moscow (*Москва*) (Q649) <https://www.wikidata.org/wiki/Q649>
capital city and the largest city of Russia; separate federal subject of
Russia
386 KB (537 statements) - 08:33, 15 October 2017

Moskva River (Q175117) <https://www.wikidata.org/wiki/Q175117>
Russian: *Москва*
river in Moscow and Moscow region
40 KB (31 statements) - 14:21, 25 September 2017

FC Moscow (Q392115) <https://www.wikidata.org/wiki/Q392115>
Russian: *Москва*
association football club
18 KB (12 statements) - 15:35, 17 October 2017

Moscow 24 (Q1572348) <https://www.wikidata.org/wiki/Q1572348>
Russian: *Москва* 24
television channel
9 KB (14 statements) - 06:13, 11 June 2017

Armenian Cemetery (Q685338) <https://www.wikidata.org/wiki/Q685338>
Russian: Армянское кладбище (*Москва*)
cemetery
8 KB (7 statements) - 10:07, 2 September 2017

... although pulling out the Russian specifically is probably not possible.

You've set yourself a complicated task!!

...
  I am sending this to wikidata-tech and discovery team
list only for now,
 since it's still work in progress and half-baked, we could open this for
 wider discussion later if necessary.

 [1] https://phabricator.wikimedia.org/T178851

 Thanks,
 --
 Stas Malyshev
 smalyshev(a)wikimedia.org

 Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Wikidata-tech] [discovery-private] Wikidata fulltext search results output