Interesting questions... my comments are inline.

On Tue, Oct 24, 2017 at 8:49 PM, Stas Malyshev <smalyshev@wikimedia.org> wrote:
Hi!

As I am working on improving Wikidata fulltext search[1], I'd like to
talk about search results page. Right now search results page for
Wikidata is less than ideal, here are the issues I see with it:

- No match highlighting

I think match highlighting would be nice, but I know it can be tricky in the edge cases.
 
- Meaningless data, like word count (anybody cares to guess what it is
counting? Anybody ever used it?) and byte count (more useful than word
count but not by much)

I don't know who is interested in that, so I don't have a strong opinion.
 
- Obviously, search quality is not super high, but that should be
improved with proper description indexing

While working on improving the situation, I would like to solicit
opinions on the set of questions about how the search results page
should look like. Namely:

1. If the match is made on label/description that does not match current
display language, we could opt for:
a) Displaying the description that matched, highlighted. Optionally
maybe display the language of the match (in display language?)
b) Displaying the description in display language, un-highlighted.
Which option is preferable?

I would definitely like to see the label that matched. Even if you don't know the language, seeing a partial match vs a full match is informative. If I search for Москва, and I get back "Moscow" and "Armenian Cemetery" I don't know what's what. Seeing that Moscow is "Russian: Москва" and Armenian Cemetery is "Russian: Армянское кладбище (Москва)" tells me immediately that Moscow is probably a better match, even if I don't know any Russian or Cyrillic.

There's a problem, though, which may be why this hasn't been done—which label do you match? For Armenian Cemetery, both Russian and Ukrainian have "Москва" in the label. For Moscow, there are 18 labels that are "Москва", another one that is a partial match (Москва балһсн), another that's a folded match (Мӧсква), and three more that have exact matches in their additional labels (including English). Unless you can define a hierarchy of languages—possibly including user languages and the "native" language of an entry—it's going to be hard to pick one. If I'd searched for Moskva and didn't have English as a user language, it'd be impossible to choose one of the 32 possible languages that are exact matches on the main label. Moskwa also doesn't match any of my user languages, or Russian, but does match a bunch of other languages—how to choose?

Any names will have similar problems. "Jacek Moskwa" is the same in all 12 languages with a label. His descriptions say he's Polish, so I guess Polish is the right answer, but I don't think there's any way to know that.

So, ideally, I'd like the name of the the language that had a label match in my display language, with a highlight of the matching bit in the description from the matched language—but I'm not sure there's a way to get there. Picking the first one alphabetically that matches will give weird results.
 

2. What we do if the match is on alias? Do we display matching alias,
original label or both? The question above also applies if the match is
on other language alias.

I'd want to see the both, maybe as "West Germany (FRG)" if I search for FRG—hey, the autocompletion suggester does that already!
 
3. It looks clear to me that words count is useless. Is byte count
useful and does it need to be kept?

4. Do we want to display any other parameters of the entity? E.g. we
have in the index: statement_count, sitelink_count, label_count,
incoming_links, etc. Do we want to display any?

Statement count is the one that is most interesting to me, but I wonder if anyone really uses any of these stats. Someone must, but I don't know their use cases.
 

5. Display format for Wikidata and for other wikipedia sites is different:
Wikpedia:

Title
Snippet

Wikidata:

Title: Description

I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on
the same line, separated by colon. Is there any reason for this
difference? Do we want to go back to the common format?

I can see that "Title: Description" saves some vertical space, but I would prefer the description to be on the next line.
 

Also if you have any other things/ideas/comments about how fulltext
search output for wikidata should be, please tell me.

Since Moscow has Москва as an additional label in English, I'm not sure if I'd also want to see a line with "Russian: Москва", too, so I left it out and used just the English alias for the city. I also got tired of counting statements on the city, so I just made something up.

capital city and the largest city of Russia; separate federal subject of Russia
386 KB (537 statements) - 08:33, 15 October 2017

Russian: Москва
river in Moscow and Moscow region
40 KB (31 statements) - 14:21, 25 September 2017

Russian: Москва
association football club
18 KB (12 statements) - 15:35, 17 October 2017

Russian: Москва 24
television channel
9 KB (14 statements) - 06:13, 11 June 2017

Russian: Армянское кладбище (Москва)
cemetery
8 KB (7 statements) - 10:07, 2 September 2017
 
... although pulling out the Russian specifically is probably not possible.

You've set yourself a complicated task!!

 
I am sending this to wikidata-tech and discovery team list only for now,
since it's still work in progress and half-baked, we could open this for
wider discussion later if necessary.

[1] https://phabricator.wikimedia.org/T178851

Thanks,
--
Stas Malyshev
smalyshev@wikimedia.org



Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation