Hi!
As I am working on improving Wikidata fulltext search[1], I'd like to talk about search results page. Right now search results page for Wikidata is less than ideal, here are the issues I see with it:
- No match highlighting - Meaningless data, like word count (anybody cares to guess what it is counting? Anybody ever used it?) and byte count (more useful than word count but not by much) - Obviously, search quality is not super high, but that should be improved with proper description indexing
While working on improving the situation, I would like to solicit opinions on the set of questions about how the search results page should look like. Namely:
1. If the match is made on label/description that does not match current display language, we could opt for: a) Displaying the description that matched, highlighted. Optionally maybe display the language of the match (in display language?) b) Displaying the description in display language, un-highlighted. Which option is preferable?
2. What we do if the match is on alias? Do we display matching alias, original label or both? The question above also applies if the match is on other language alias.
3. It looks clear to me that words count is useless. Is byte count useful and does it need to be kept?
4. Do we want to display any other parameters of the entity? E.g. we have in the index: statement_count, sitelink_count, label_count, incoming_links, etc. Do we want to display any?
5. Display format for Wikidata and for other wikipedia sites is different: Wikpedia:
Title Snippet
Wikidata:
Title: Description
I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on the same line, separated by colon. Is there any reason for this difference? Do we want to go back to the common format?
Also if you have any other things/ideas/comments about how fulltext search output for wikidata should be, please tell me.
I am sending this to wikidata-tech and discovery team list only for now, since it's still work in progress and half-baked, we could open this for wider discussion later if necessary.
[1] https://phabricator.wikimedia.org/T178851
Thanks,
Hi Stas,
while you are at it, some things would be very useful to be search-able (maybe some are already by now): * "primary" (not references/qualifiers) years, for birth/death/flourit etc. * "primary" string/monolingual values (title, taxon name, etc.) * "primary" IDs, e.g. VIAF (might cause confusion with years, so maybe only add numerical IDs if 5+ digits?)
Cheers, Magnus
On Wed, Oct 25, 2017 at 1:50 AM Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
As I am working on improving Wikidata fulltext search[1], I'd like to talk about search results page. Right now search results page for Wikidata is less than ideal, here are the issues I see with it:
- No match highlighting
- Meaningless data, like word count (anybody cares to guess what it is
counting? Anybody ever used it?) and byte count (more useful than word count but not by much)
- Obviously, search quality is not super high, but that should be
improved with proper description indexing
While working on improving the situation, I would like to solicit opinions on the set of questions about how the search results page should look like. Namely:
- If the match is made on label/description that does not match current
display language, we could opt for: a) Displaying the description that matched, highlighted. Optionally maybe display the language of the match (in display language?) b) Displaying the description in display language, un-highlighted. Which option is preferable?
- What we do if the match is on alias? Do we display matching alias,
original label or both? The question above also applies if the match is on other language alias.
- It looks clear to me that words count is useless. Is byte count
useful and does it need to be kept?
- Do we want to display any other parameters of the entity? E.g. we
have in the index: statement_count, sitelink_count, label_count, incoming_links, etc. Do we want to display any?
- Display format for Wikidata and for other wikipedia sites is different:
Wikpedia:
Title Snippet
Wikidata:
Title: Description
I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on the same line, separated by colon. Is there any reason for this difference? Do we want to go back to the common format?
Also if you have any other things/ideas/comments about how fulltext search output for wikidata should be, please tell me.
I am sending this to wikidata-tech and discovery team list only for now, since it's still work in progress and half-baked, we could open this for wider discussion later if necessary.
[1] https://phabricator.wikimedia.org/T178851
Thanks,
Stas Malyshev smalyshev@wikimedia.org
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Hi!
while you are at it, some things would be very useful to be search-able (maybe some are already by now):
- "primary" (not references/qualifiers) years, for birth/death/flourit etc.
- "primary" string/monolingual values (title, taxon name, etc.)
- "primary" IDs, e.g. VIAF (might cause confusion with years, so maybe
only add numerical IDs if 5+ digits?)
We have the code to index statements already, and we're already indexing P31 and P279. We could index more properties. We don't have syntax or any other way though to actually use those in search - yet, except for boosting (see https://gerrit.wikimedia.org/r/#/c/384632/).
We're looking at which properties to add (nominations welcome, probably in the form of phab ticket?) - since adding them requires full reindex of wikidata (couple of days) we probably don't want to add them one by one but want to collect a set and then do it in one hit.
We also do not have syntax for searching (as in match, instead of boost) by statement values, but it should not be hard - we just need to design proper syntax and implement it (syntaxes are now pluggable, so should not be too big of a problem).
Interesting questions... my comments are inline.
On Tue, Oct 24, 2017 at 8:49 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
As I am working on improving Wikidata fulltext search[1], I'd like to talk about search results page. Right now search results page for Wikidata is less than ideal, here are the issues I see with it:
- No match highlighting
I think match highlighting would be nice, but I know it can be tricky in the edge cases.
- Meaningless data, like word count (anybody cares to guess what it is
counting? Anybody ever used it?) and byte count (more useful than word count but not by much)
I don't know who is interested in that, so I don't have a strong opinion.
- Obviously, search quality is not super high, but that should be
improved with proper description indexing
While working on improving the situation, I would like to solicit opinions on the set of questions about how the search results page should look like. Namely:
- If the match is made on label/description that does not match current
display language, we could opt for: a) Displaying the description that matched, highlighted. Optionally maybe display the language of the match (in display language?) b) Displaying the description in display language, un-highlighted. Which option is preferable?
I would definitely like to see the label that matched. Even if you don't know the language, seeing a partial match vs a full match is informative. If I search for *Москва,* and I get back "Moscow" and "Armenian Cemetery" I don't know what's what. Seeing that Moscow is "Russian: *Москва*" and Armenian Cemetery is "Russian: Армянское кладбище (*Москва*)" tells me immediately that Moscow is probably a better match, even if I don't know any Russian or Cyrillic.
There's a problem, though, which may be why this hasn't been done—*which* label do you match? For Armenian Cemetery, both Russian and Ukrainian have "Москва" in the label. For Moscow, there are 18 labels that are "Москва", another one that is a partial match (Москва балһсн), another that's a folded match (Мӧсква), and three more that have exact matches in their additional labels (including English). Unless you can define a hierarchy of languages—possibly including user languages and the "native" language of an entry—it's going to be hard to pick one. If I'd searched for *Moskva* and didn't have English as a user language, it'd be impossible to choose one of the 32 possible languages that are exact matches on the main label. *Moskwa* also doesn't match any of my user languages, or Russian, but does match a bunch of other languages—how to choose?
Any names will have similar problems. "Jacek Moskwa" is the same in all 12 languages with a label. His descriptions say he's Polish, so I guess Polish is the right answer, but I don't think there's any way to know that.
So, ideally, *I'd* like the name of the the language that had a label match in my display language, with a highlight of the matching bit in the description from the matched language—but I'm not sure there's a way to get there. Picking the first one alphabetically that matches will give weird results.
- What we do if the match is on alias? Do we display matching alias,
original label or both? The question above also applies if the match is on other language alias.
I'd want to see the both, maybe as "West Germany (*FRG*)" if I search for FRG—hey, the autocompletion suggester does that already!
- It looks clear to me that words count is useless. Is byte count
useful and does it need to be kept?
- Do we want to display any other parameters of the entity? E.g. we
have in the index: statement_count, sitelink_count, label_count, incoming_links, etc. Do we want to display any?
Statement count is the one that is most interesting to me, but I wonder if anyone really uses any of these stats. Someone must, but I don't know their use cases.
- Display format for Wikidata and for other wikipedia sites is different:
Wikpedia:
Title Snippet
Wikidata:
Title: Description
I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on the same line, separated by colon. Is there any reason for this difference? Do we want to go back to the common format?
I can see that "Title: Description" saves some vertical space, but I would prefer the description to be on the next line.
Also if you have any other things/ideas/comments about how fulltext search output for wikidata should be, please tell me.
Since Moscow has Москва as an additional label in English, I'm not sure if I'd also want to see a line with "Russian: Москва", too, so I left it out and used just the English alias for the city. I also got tired of counting statements on the city, so I just made something up.
Moscow (*Москва*) (Q649) https://www.wikidata.org/wiki/Q649 capital city and the largest city of Russia; separate federal subject of Russia 386 KB (537 statements) - 08:33, 15 October 2017
Moskva River (Q175117) https://www.wikidata.org/wiki/Q175117 Russian: *Москва* river in Moscow and Moscow region 40 KB (31 statements) - 14:21, 25 September 2017
FC Moscow (Q392115) https://www.wikidata.org/wiki/Q392115 Russian: *Москва* association football club 18 KB (12 statements) - 15:35, 17 October 2017
Moscow 24 (Q1572348) https://www.wikidata.org/wiki/Q1572348 Russian: *Москва* 24 television channel 9 KB (14 statements) - 06:13, 11 June 2017
Armenian Cemetery (Q685338) https://www.wikidata.org/wiki/Q685338 Russian: Армянское кладбище (*Москва*) cemetery 8 KB (7 statements) - 10:07, 2 September 2017
... although pulling out the Russian specifically is probably not possible.
You've set yourself a complicated task!!
I am sending this to wikidata-tech and discovery team list only for now, since it's still work in progress and half-baked, we could open this for wider discussion later if necessary.
[1] https://phabricator.wikimedia.org/T178851
Thanks,
Stas Malyshev smalyshev@wikimedia.org
Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation
Hi!
I would definitely like to see the label that matched. Even if you don't know the language, seeing a partial match vs a full match is informative. If I search for /Москва,/ and I get back "Moscow" and "Armenian Cemetery" I don't know what's what. Seeing that Moscow is "Russian: *Москва*" and Armenian Cemetery is "Russian: Армянское кладбище (*Москва*)" tells me immediately that Moscow is probably a better match, even if I don't know any Russian or Cyrillic.
Which kinda tells me maybe we should display both. How about this: - display the main title in the title position (highlight if it matched) - if the match was in alias, display under it: (alias: <HIGHLIGHTED MATCH>) - if the match was in different language, display instead under it: (Russian: <HIGHLIGHTED RUSSIAN MATCH>) - then display description in display language (highlight if it matched) - then under it, if different language description matched, display it below: (Russian: <HIGHLIGHTED RUSSIAN DESCRIPTION>)
That'd be a bit more work, but I *think* it should be doable (famous last words :).
Any names will have similar problems. "Jacek Moskwa" is the same in all 12 languages with a label. His descriptions say he's Polish, so I guess Polish is the right answer, but I don't think there's any way to know that.
The current display language will always take priority, I think. Then goes the fallback chain (we'll still use fallbacks for these purposes I think). If none of these march, I guess it'd be up to source_text field, in which case we probably won't have any highlighting working (since highlighting is useless on that) but maybe we could try some tricks with highlighter to get at least something out of it.
I'd want to see the both, maybe as "West Germany (*FRG*)" if I search for FRG—hey, the autocompletion suggester does that already!
Yep, but it has completely different GUI, so I can't really reuse it. I can take inspiration from it though :)
I can see that "Title: Description" saves some vertical space, but I would prefer the description to be on the next line.
Same here. Especially as - per above - we could have more than one of each.
Moscow (*Москва*) (Q649) https://www.wikidata.org/wiki/Q649 capital city and the largest city of Russia; separate federal subject of Russia 386 KB (537 statements) - 08:33, 15 October 2017
Moskva River (Q175117) https://www.wikidata.org/wiki/Q175117 Russian: *Москва* river in Moscow and Moscow region 40 KB (31 statements) - 14:21, 25 September 2017
FC Moscow (Q392115) https://www.wikidata.org/wiki/Q392115 Russian: *Москва* association football club 18 KB (12 statements) - 15:35, 17 October 2017
Moscow 24 (Q1572348) https://www.wikidata.org/wiki/Q1572348 Russian: *Москва* 24 television channel 9 KB (14 statements) - 06:13, 11 June 2017
Armenian Cemetery (Q685338) https://www.wikidata.org/wiki/Q685338 Russian: Армянское кладбище (*Москва*) cemetery 8 KB (7 statements) - 10:07, 2 September 2017 ... although pulling out the Russian specifically is probably not possible.
Looks good to me, I'll see which ones of these are actually possible :) And many thanks for detailed feedback, that's exactly what I was looking for.
All really great suggestions and thanks for seeing which ones we can do, Stas! :) Let me know if you want/need help with tickets, UI work, etc.
Cheers,
Deb
-- deb tankersley Product Manager, Discovery Wikimedia Foundation
On Wed, Oct 25, 2017 at 12:54 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I would definitely like to see the label that matched. Even if you don't know the language, seeing a partial match vs a full match is informative. If I search for /Москва,/ and I get back "Moscow" and "Armenian Cemetery" I don't know what's what. Seeing that Moscow is "Russian: *Москва*" and Armenian Cemetery is "Russian: Армянское кладбище (*Москва*)" tells me immediately that Moscow is probably a better match, even if I don't know any Russian or Cyrillic.
Which kinda tells me maybe we should display both. How about this:
- display the main title in the title position (highlight if it matched)
- if the match was in alias, display under it: (alias: <HIGHLIGHTED MATCH>)
- if the match was in different language, display instead under it:
(Russian: <HIGHLIGHTED RUSSIAN MATCH>)
- then display description in display language (highlight if it matched)
- then under it, if different language description matched, display it
below: (Russian: <HIGHLIGHTED RUSSIAN DESCRIPTION>)
That'd be a bit more work, but I *think* it should be doable (famous last words :).
Any names will have similar problems. "Jacek Moskwa" is the same in all 12 languages with a label. His descriptions say he's Polish, so I guess Polish is the right answer, but I don't think there's any way to know
that.
The current display language will always take priority, I think. Then goes the fallback chain (we'll still use fallbacks for these purposes I think). If none of these march, I guess it'd be up to source_text field, in which case we probably won't have any highlighting working (since highlighting is useless on that) but maybe we could try some tricks with highlighter to get at least something out of it.
I'd want to see the both, maybe as "West Germany (*FRG*)" if I search for FRG—hey, the autocompletion suggester does that already!
Yep, but it has completely different GUI, so I can't really reuse it. I can take inspiration from it though :)
I can see that "Title: Description" saves some vertical space, but I would prefer the description to be on the next line.
Same here. Especially as - per above - we could have more than one of each.
Moscow (*Москва*) (Q649) https://www.wikidata.org/wiki/Q649 capital city and the largest city of Russia; separate federal subject of Russia 386 KB (537 statements) - 08:33, 15 October 2017
Moskva River (Q175117) https://www.wikidata.org/wiki/Q175117 Russian: *Москва* river in Moscow and Moscow region 40 KB (31 statements) - 14:21, 25 September 2017
FC Moscow (Q392115) https://www.wikidata.org/wiki/Q392115 Russian: *Москва* association football club 18 KB (12 statements) - 15:35, 17 October 2017
Moscow 24 (Q1572348) https://www.wikidata.org/wiki/Q1572348 Russian: *Москва* 24 television channel 9 KB (14 statements) - 06:13, 11 June 2017
Armenian Cemetery (Q685338) https://www.wikidata.org/wiki/Q685338 Russian: Армянское кладбище (*Москва*) cemetery 8 KB (7 statements) - 10:07, 2 September 2017
... although pulling out the Russian specifically is probably not
possible.
Looks good to me, I'll see which ones of these are actually possible :) And many thanks for detailed feedback, that's exactly what I was looking for.
-- Stas Malyshev smalyshev@wikimedia.org
discovery-private mailing list discovery-private@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery-private
Hey :)
Thanks for getting this started.
On Wed, Oct 25, 2017 at 2:49 AM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
As I am working on improving Wikidata fulltext search[1], I'd like to talk about search results page. Right now search results page for Wikidata is less than ideal, here are the issues I see with it:
- No match highlighting
- Meaningless data, like word count (anybody cares to guess what it is
counting? Anybody ever used it?) and byte count (more useful than word count but not by much)
- Obviously, search quality is not super high, but that should be
improved with proper description indexing
While working on improving the situation, I would like to solicit opinions on the set of questions about how the search results page should look like. Namely:
- If the match is made on label/description that does not match current
display language, we could opt for: a) Displaying the description that matched, highlighted. Optionally maybe display the language of the match (in display language?) b) Displaying the description in display language, un-highlighted. Which option is preferable?
When showing labels from fallback languages we do have little language indicators in other places. I believe we should have this here as well. Otherwise I believe it is confusing where certain labels suddenly come from because you might not see them when going to the actual item.
- What we do if the match is on alias? Do we display matching alias,
original label or both? The question above also applies if the match is on other language alias.
I'm slightly leaning toward showing both.
- It looks clear to me that words count is useless. Is byte count
useful and does it need to be kept?
It helps in the cases where you want to get an understanding about how large an item is and if it is worth your attention. If people actually use it... Not sure. They definitely do in recent changes and history.
- Do we want to display any other parameters of the entity? E.g. we
have in the index: statement_count, sitelink_count, label_count, incoming_links, etc. Do we want to display any?
I'd say in this case we could get rid of the word/byte count. To get a good glimpse of the quality of the item I'd say we'd want to show count of statements (excluding identifier statements), identifiers and sitelinks.
- Display format for Wikidata and for other wikipedia sites is different:
Wikpedia:
Title Snippet
Wikidata:
Title: Description
I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on the same line, separated by colon. Is there any reason for this difference? Do we want to go back to the common format?
Not sure if we had a reason tbh.
Also if you have any other things/ideas/comments about how fulltext search output for wikidata should be, please tell me.
I am sending this to wikidata-tech and discovery team list only for now, since it's still work in progress and half-baked, we could open this for wider discussion later if necessary.
[1] https://phabricator.wikimedia.org/T178851
Thanks,
Stas Malyshev smalyshev@wikimedia.org
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Hi!
When showing labels from fallback languages we do have little language indicators in other places. I believe we should have this here as
Makes sense. I'll look into how to get those. Is language code OK or we need full language name (uk vs. Ukrainian)?
One thing to note here is that secondary languages have no order - i.e. if you look in German, and there's no matching German label, but there are 10 other language labels all the same (happens a lot for names & places), which language will be selected is anybody's guess. We could add rule that says "look at English as secondary first", in theory, but not sure whether we should - after all, besides having most languages, (and us speaking it :) there's not much special about it.
I'm slightly leaning toward showing both.
OK.
I'd say in this case we could get rid of the word/byte count. To get a good glimpse of the quality of the item I'd say we'd want to show count of statements (excluding identifier statements), identifiers and sitelinks.
OK, I'll try to make this.
- Display format for Wikidata and for other wikipedia sites is different:
Wikpedia:
Title Snippet
Wikidata:
Title: Description
I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on the same line, separated by colon. Is there any reason for this difference? Do we want to go back to the common format?
Not sure if we had a reason tbh.
OK then, I'll feel free to shuffle things around then :) Having more freedom in the title line is good because we can then display both label & aliases.
Thanks!
Philipino Cuban Jakarata Japan Magic fedia Una Anatorio Iron Fist Stallin First of Alis Hienzeal Alias Bryntzne Robert S Delacruz Robert M Delacruz Robert G Delacruz Robert H Delacruz Robert X Delacruz Robert A Delacruz.=====R1 No Change inner working HISTORIA B 4 SIR Hugene Issis Eugene Promote Ur Culture Adapt or callapse Your Union Offenssive Windows Brakeing Law Build a Wall Crisis Guallce Stand For ××× History Canada Mexico France SouthWest Angle Saxon My Elizabathan English Palabras Racism Destiny Do Donts pro Long Hebrew Pro Con Kings Men BloodLine You Cant Find A Genration Century Late but Robert Downy I Am For All Under G.O.D you will Moscow Seca Outline Above the Law OMB IBEW IMB Ant Vex Spectum Illuminate sermon Homily 90s Illumanati Dela Cartels On the Way to a Azillvillalilaliawilla Hallocaust
On Nov 2, 2017 5:40 PM, "Stas Malyshev" smalyshev@wikimedia.org wrote:
Hi!
When showing labels from fallback languages we do have little language indicators in other places. I believe we should have this here as
Makes sense. I'll look into how to get those. Is language code OK or we need full language name (uk vs. Ukrainian)?
One thing to note here is that secondary languages have no order - i.e. if you look in German, and there's no matching German label, but there are 10 other language labels all the same (happens a lot for names & places), which language will be selected is anybody's guess. We could add rule that says "look at English as secondary first", in theory, but not sure whether we should - after all, besides having most languages, (and us speaking it :) there's not much special about it.
I'm slightly leaning toward showing both.
OK.
I'd say in this case we could get rid of the word/byte count. To get a good glimpse of the quality of the item I'd say we'd want to show count of statements (excluding identifier statements), identifiers and sitelinks.
OK, I'll try to make this.
- Display format for Wikidata and for other wikipedia sites is
different:
Wikpedia:
Title Snippet
Wikidata:
Title: Description
I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on the same line, separated by colon. Is there any reason for this difference? Do we want to go back to the common format?
Not sure if we had a reason tbh.
OK then, I'll feel free to shuffle things around then :) Having more freedom in the title line is good because we can then display both label & aliases.
Thanks! -- Stas Malyshev smalyshev@wikimedia.org
_______________________________________________ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
On Nov 3, 2017 08:39, "Stas Malyshev" smalyshev@wikimedia.org wrote:
Hi!
When showing labels from fallback languages we do have little language indicators in other places. I believe we should have this here as
Makes sense. I'll look into how to get those. Is language code OK or we need full language name (uk vs. Ukrainian)?
In the other places we show the language name. If say we should do the same here if possible.
One thing to note here is that secondary languages have no order - i.e. if you look in German, and there's no matching German label, but there are 10 other language labels all the same (happens a lot for names & places), which language will be selected is anybody's guess. We could add rule that says "look at English as secondary first", in theory, but not sure whether we should - after all, besides having most languages, (and us speaking it :) there's not much special about it.
Uhhh yeah. I don't have a better idea either TBH.
I'm slightly leaning toward showing both.
OK.
I'd say in this case we could get rid of the word/byte count. To get a good glimpse of the quality of the item I'd say we'd want to show count of statements (excluding identifier statements), identifiers and sitelinks.
OK, I'll try to make this.
- Display format for Wikidata and for other wikipedia sites is
different:
Wikpedia:
Title Snippet
Wikidata:
Title: Description
I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on the same line, separated by colon. Is there any reason for this difference? Do we want to go back to the common format?
Not sure if we had a reason tbh.
OK then, I'll feel free to shuffle things around then :) Having more freedom in the title line is good because we can then display both label & aliases.
Thanks! -- Stas Malyshev smalyshev@wikimedia.org
wikidata-tech@lists.wikimedia.org