Interesting questions... my comments are inline.
On Tue, Oct 24, 2017 at 8:49 PM, Stas Malyshev <smalyshev(a)wikimedia.org>
wrote:
Hi!
As I am working on improving Wikidata fulltext search[1], I'd like to
talk about search results page. Right now search results page for
Wikidata is less than ideal, here are the issues I see with it:
- No match highlighting
I think match highlighting would be nice, but I know it can be tricky in
the edge cases.
- Meaningless data, like word count (anybody cares to
guess what it is
counting? Anybody ever used it?) and byte count (more useful than word
count but not by much)
I don't know who is interested in that, so I don't have a strong opinion.
- Obviously, search quality is not super high, but
that should be
improved with proper description indexing
While working on improving the situation, I would like to solicit
opinions on the set of questions about how the search results page
should look like. Namely:
1. If the match is made on label/description that does not match current
display language, we could opt for:
a) Displaying the description that matched, highlighted. Optionally
maybe display the language of the match (in display language?)
b) Displaying the description in display language, un-highlighted.
Which option is preferable?
I would definitely like to see the label that matched. Even if you don't
know the language, seeing a partial match vs a full match is informative.
If I search for *Москва,* and I get back "Moscow" and "Armenian
Cemetery" I
don't know what's what. Seeing that Moscow is "Russian: *Москва*" and
Armenian Cemetery is "Russian: Армянское кладбище (*Москва*)" tells me
immediately that Moscow is probably a better match, even if I don't know
any Russian or Cyrillic.
There's a problem, though, which may be why this hasn't been done—*which* label
do you match? For Armenian Cemetery, both Russian and Ukrainian have "Москва"
in the label. For Moscow, there are 18 labels that are "Москва", another
one that is a partial match (Москва балһсн), another that's a folded
match (Мӧсква),
and three more that have exact matches in their additional labels
(including English). Unless you can define a hierarchy of
languages—possibly including user languages and the "native" language of an
entry—it's going to be hard to pick one. If I'd searched for *Moskva* and
didn't have English as a user language, it'd be impossible to choose one of
the 32 possible languages that are exact matches on the main label.
*Moskwa* also
doesn't match any of my user languages, or Russian, but does match a bunch
of other languages—how to choose?
Any names will have similar problems. "Jacek Moskwa" is the same in all 12
languages with a label. His descriptions say he's Polish, so I guess Polish
is the right answer, but I don't think there's any way to know that.
So, ideally, *I'd* like the name of the the language that had a label match
in my display language, with a highlight of the matching bit in the
description from the matched language—but I'm not sure there's a way to get
there. Picking the first one alphabetically that matches will give weird
results.
2. What we do if the match is on alias? Do we display matching alias,
original label or both? The question above also applies if the match is
on other language alias.
I'd want to see the both, maybe as "West Germany (*FRG*)" if I search for
FRG—hey, the autocompletion suggester does that already!
3. It looks clear to me that words count is useless.
Is byte count
useful and does it need to be kept?
4. Do we want to display any other parameters of the entity? E.g. we
have in the index: statement_count, sitelink_count, label_count,
incoming_links, etc. Do we want to display any?
Statement count is the one that is most interesting to me, but I wonder if
anyone really uses any of these stats. Someone must, but I don't know their
use cases.
5. Display format for Wikidata and for other wikipedia sites is different:
Wikpedia:
Title
Snippet
Wikidata:
Title: Description
I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on
the same line, separated by colon. Is there any reason for this
difference? Do we want to go back to the common format?
I can see that "Title: Description" saves some vertical space, but I would
prefer the description to be on the next line.
Also if you have any other things/ideas/comments about how fulltext
search output for wikidata should be, please tell me.
Since Moscow has Москва as an additional label in English, I'm not sure if
I'd also want to see a line with "Russian: Москва", too, so I left it out
and used just the English alias for the city. I also got tired of counting
statements on the city, so I just made something up.
Moscow (*Москва*) (Q649) <https://www.wikidata.org/wiki/Q649>
capital city and the largest city of Russia; separate federal subject of
Russia
386 KB (537 statements) - 08:33, 15 October 2017
Moskva River (Q175117) <https://www.wikidata.org/wiki/Q175117>
Russian: *Москва*
river in Moscow and Moscow region
40 KB (31 statements) - 14:21, 25 September 2017
FC Moscow (Q392115) <https://www.wikidata.org/wiki/Q392115>
Russian: *Москва*
association football club
18 KB (12 statements) - 15:35, 17 October 2017
Moscow 24 (Q1572348) <https://www.wikidata.org/wiki/Q1572348>
Russian: *Москва* 24
television channel
9 KB (14 statements) - 06:13, 11 June 2017
Armenian Cemetery (Q685338) <https://www.wikidata.org/wiki/Q685338>
Russian: Армянское кладбище (*Москва*)
cemetery
8 KB (7 statements) - 10:07, 2 September 2017
... although pulling out the Russian specifically is probably not possible.
You've set yourself a complicated task!!
I am sending this to wikidata-tech and discovery team
list only for now,
since it's still work in progress and half-baked, we could open this for
wider discussion later if necessary.
[1]
https://phabricator.wikimedia.org/T178851
Thanks,
--
Stas Malyshev
smalyshev(a)wikimedia.org
Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation