Re: [Wikidata-tech] [discovery-private] Wikidata full text search

31 May 2018

Nice write up, Stas!

I worry that with the "best possible query" will be confusing or
undesirable because the ranking will be lousy, as you suggest. You can test
it and see—in RelForge you can compare the namespace-specific scoring
function and a proposed generic scoring function for a sample of queries in
a particular namespace and see how much it churns the top results.

For the "garden of forking queries" (most excellent poetic naming of
options!!), there is a straightforward though somewhat tedious way of
merging rankings. You can use an empirical distribution function
<https://en.wikipedia.org/wiki/Empirical_distribution_function> on a
largish sample of scores for each ranking function, and then normalize them
to a range of 0.0-1.0. (RelForge even has a tool to generate a more
tractable and reasonably accurate piecewise linear approximation
<https://en.wikipedia.org/wiki/Piecewise_linear_function> given empirical
distribution data.) Effectively, you turn all scores into percentiles, and
assume those are comparable. It's ugly, but if that's the only thing
holding you back from that option, it's doable. It is tedious and a bit
brittle, though, because if the scoring function for any component changes,
you need to regenerate its EDF function—though that task could be automated
to some degree if it comes up regularly.

—Trey

Trey Jones
Sr. Software Engineer, Search Platform
Wikimedia Foundation

On Wed, May 30, 2018 at 8:26 PM, Stas Malyshev &lt;smalyshev(a)wikimedia.org&gt;
wrote:

...
  Hi!

 While working on fulltext search for Lexemes, I have encountered a
 question which I think needs to be discussed and resolved. The question
 is how fulltext search should be working when dealing with different
 content models and what search should do by default and in specialized
 cases.

 The main challenge in Wikidata is that we are dealing with substantially
 different content models - articles, Items (including Properties,
 because while being formally different type, they are similar enough to
 Items for search to ignore the difference) and Lexemes organize their
 data in a different way, and should be searched using different
 specialized queries. This is currently unique for Wikidata, but SDC
 might eventually have the same challenge to deal with. I've described
 challenges and questions there are here in more detail:

 https://www.wikidata.org/wiki/User:Smalyshev_(WMF)/Wikidata_
 search#Fulltext_search

 I'd like to first hear some feedback about what are the expectations
 about the combined search are - what is expected to work, how it is
 expected to work, what are the defaults, what are the use cases for
 these. I have outlined some solutions that were proposed on wiki, if you
 have any comments please feel welcome to respond either here or on wiki.

 TLDR version of it is that doing search on different data models is
 hard, and we would need to sacrifice something to make it work. We need
 to figure out and decide which of these sacrifices are acceptable and
 what is enabled/disabled by default.

 Thanks,
 --
 Stas Malyshev
 smalyshev(a)wikimedia.org

 _______________________________________________
 discovery-private mailing list
 discovery-private(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/discovery-private

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Wikidata-tech] [discovery-private] Wikidata full text search