Wikidata full text search

List overview All Threads
Download

newer

older

Wikibase property order

Fastest way (API or whatever) to...

Stas Malyshev

31 May 2018 31 May '18

2:26 a.m.

Hi!

While working on fulltext search for Lexemes, I have encountered a question which I think needs to be discussed and resolved. The question is how fulltext search should be working when dealing with different content models and what search should do by default and in specialized cases.

The main challenge in Wikidata is that we are dealing with substantially different content models - articles, Items (including Properties, because while being formally different type, they are similar enough to Items for search to ignore the difference) and Lexemes organize their data in a different way, and should be searched using different specialized queries. This is currently unique for Wikidata, but SDC might eventually have the same challenge to deal with. I've described challenges and questions there are here in more detail:

https://www.wikidata.org/wiki/User:Smalyshev_(WMF)/Wikidata_search#Fulltext_...

I'd like to first hear some feedback about what are the expectations about the combined search are - what is expected to work, how it is expected to work, what are the defaults, what are the use cases for these. I have outlined some solutions that were proposed on wiki, if you have any comments please feel welcome to respond either here or on wiki.

TLDR version of it is that doing search on different data models is hard, and we would need to sacrifice something to make it work. We need to figure out and decide which of these sacrifices are acceptable and what is enabled/disabled by default.

Thanks,

-- Stas Malyshev smalyshev@wikimedia.org

Show replies by date

Lydia Pintscher

31 May 31 May

11:25 a.m.

Hey Stas,

Thanks for digging into this and writing it down! Would there be any drawback with the following steps as way forward and possibility to learn more as we go? 1. We return results for the Lexeme namespace only when people explicitly select it 2. We get feedback 3. We go the "Best possible query" route when people select all namespaces 4. We get feedback 5. We go the "Best possible query" route for all searches if feedback indicates this is useful (I don't know at this point)

Cheers Lydia On Thu, May 31, 2018 at 2:26 AM Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

While working on fulltext search for Lexemes, I have encountered a question which I think needs to be discussed and resolved. The question is how fulltext search should be working when dealing with different content models and what search should do by default and in specialized cases.

The main challenge in Wikidata is that we are dealing with substantially different content models - articles, Items (including Properties, because while being formally different type, they are similar enough to Items for search to ignore the difference) and Lexemes organize their data in a different way, and should be searched using different specialized queries. This is currently unique for Wikidata, but SDC might eventually have the same challenge to deal with. I've described challenges and questions there are here in more detail:

https://www.wikidata.org/wiki/User:Smalyshev_(WMF)/Wikidata_search#Fulltext_...

I'd like to first hear some feedback about what are the expectations about the combined search are - what is expected to work, how it is expected to work, what are the defaults, what are the use cases for these. I have outlined some solutions that were proposed on wiki, if you have any comments please feel welcome to respond either here or on wiki.

TLDR version of it is that doing search on different data models is hard, and we would need to sacrifice something to make it work. We need to figure out and decide which of these sacrifices are acceptable and what is enabled/disabled by default.

Thanks,

Stas Malyshev smalyshev@wikimedia.org

-- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Stas Malyshev

11:31 p.m.

Hi!

...

Would there be any drawback with the following steps as way forward and possibility to learn more as we go?

We return results for the Lexeme namespace only when people

explicitly select it

If you mean "it and only it" (as opposed to Lexemes + any other namespace), then yes, this is doable and this is probably what I am going to start with. However, a lot of people - as I observed with several community members - tend to use "All" option and expect it to work.

...

We get feedback

We go the "Best possible query" route when people select all namespaces

We get feedback

We go the "Best possible query" route for all searches if feedback

indicates this is useful (I don't know at this point)

I am not sure which mode is best for Wikidata now, there are at least several plausible ways do go by default for Special:Search: 1. Search in Items only 2. Search in Items + Properties 3. Search in Items + Properties + Lexemes 4. Search in Items + Lexemes 5. Any of the above plus some of the article spaces (i.e. Wikidata or Help)

This requires mixed search working (except for 1 and 2) but is a separate decision from it.

-- Stas Malyshev smalyshev@wikimedia.org

Trey Jones

10:56 p.m.

New subject: [discovery-private] Wikidata full text search

Nice write up, Stas!

I worry that with the "best possible query" will be confusing or undesirable because the ranking will be lousy, as you suggest. You can test it and see—in RelForge you can compare the namespace-specific scoring function and a proposed generic scoring function for a sample of queries in a particular namespace and see how much it churns the top results.

For the "garden of forking queries" (most excellent poetic naming of options!!), there is a straightforward though somewhat tedious way of merging rankings. You can use an empirical distribution function https://en.wikipedia.org/wiki/Empirical_distribution_function on a largish sample of scores for each ranking function, and then normalize them to a range of 0.0-1.0. (RelForge even has a tool to generate a more tractable and reasonably accurate piecewise linear approximation https://en.wikipedia.org/wiki/Piecewise_linear_function given empirical distribution data.) Effectively, you turn all scores into percentiles, and assume those are comparable. It's ugly, but if that's the only thing holding you back from that option, it's doable. It is tedious and a bit brittle, though, because if the scoring function for any component changes, you need to regenerate its EDF function—though that task could be automated to some degree if it comes up regularly.

—Trey

Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation

On Wed, May 30, 2018 at 8:26 PM, Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

While working on fulltext search for Lexemes, I have encountered a question which I think needs to be discussed and resolved. The question is how fulltext search should be working when dealing with different content models and what search should do by default and in specialized cases.

The main challenge in Wikidata is that we are dealing with substantially different content models - articles, Items (including Properties, because while being formally different type, they are similar enough to Items for search to ignore the difference) and Lexemes organize their data in a different way, and should be searched using different specialized queries. This is currently unique for Wikidata, but SDC might eventually have the same challenge to deal with. I've described challenges and questions there are here in more detail:

https://www.wikidata.org/wiki/User:Smalyshev_(WMF)/Wikidata_ search#Fulltext_search

I'd like to first hear some feedback about what are the expectations about the combined search are - what is expected to work, how it is expected to work, what are the defaults, what are the use cases for these. I have outlined some solutions that were proposed on wiki, if you have any comments please feel welcome to respond either here or on wiki.

TLDR version of it is that doing search on different data models is hard, and we would need to sacrifice something to make it work. We need to figure out and decide which of these sacrifices are acceptable and what is enabled/disabled by default.

Thanks,

Stas Malyshev smalyshev@wikimedia.org

discovery-private mailing list discovery-private@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery-private

Stas Malyshev

11:19 p.m.

New subject: [discovery-private] Wikidata full text search

Hi!

...

I worry that with the "best possible query" will be confusing or undesirable because the ranking will be lousy, as you suggest. You can

True, but as other options do not solve the ranking problem either, we at least get something stable and predictable there. I am not very clear on what ranking on mixed searches means anyway, so maybe a lousy one is ok as long as user request is "just find me something".

...

For the "garden of forking queries" (most excellent poetic naming of options!!), there is a straightforward though somewhat tedious way of merging rankings. You can use an empirical distribution function

Thanks, it sounds like a good idea, but I assume to implement it we'll need: 1. Distribution profile for each of the query types, which will I assume be highly specialized 2. Some code that actually does the score merging inside Elastic (since in order to do pagination we need Elastic to do all the ranking)

And the ranking would still suck initially until we collected proper distribution. This would also make getting this system set up pretty non-trivial - after deploying the code, we'd need to collect the stats, calculate the distribution, and then feed it back to the code - which for an open-source component like Wikibase sounds a bit sub-optimal. Still, it's an excellent information which at least gives us a theoretical way forward here, even though requiring a lot of work.

-- Stas Malyshev smalyshev@wikimedia.org

2408

Age (days ago)

2408

Last active (days ago)

wikidata-tech@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Lydia Pintscher
Stas Malyshev
Trey Jones