Re: [Wikimedia-search] Fwd: Zero search results—how can I help?

23 Jul 2015

To keep the message size down, I'm going to trim heavily..

...
  =Results from other wikis=

 I understand that the method used by Kolkus and Rehurek is
 dictionary-based (word unigrams)? It will outperform cybozu (char ngram
 based) on small texts. I think it's true if the text is like tweets with
 short phrases but may not work properly for names? This certainly deserves
 some test on real data.

Yeah, names are a pain for lots of reasons. n-grams may help categorize
them ethnolinguistically, similarly to language identification, but that
doesn't tell you where to search. For example, Célia Šašić is a German
footballer with a French first name and Croatian last name (by
marriage)—and she's not in the Croatian wiki, though she is in English,
German, French, and others. Did I mention that names are a pain?

A couple more interesting ideas from a couple of papers (though there's
always a danger of falling down the literature rabbit hole):

Looking at tweets: http://ceur-ws.org/Vol-1228/tweetlid-1-gamallo.pdf
    - good results on tweets with Naive Bayes classifier built on words,
and decent results with a simple ranked list of the top N words
    - in both cases they added simple suffix scoring to get what I think of
as the best bit of n-grams

Looking at "query-style" texts:
http://www.uni-weimar.de/medien/webis/publications/papers/lipka_2010.pdf
    - claim good results with a Naive Bayes classifier built on
n-grams—though they use 4-grams and 5-grams

But, yeah, everything comes down to is it fast, easy to implement, and how
does it perform on real data.

...
  This raises another question as we add more fall-back
methods to decrease
 the zero result rate. How will we prioritize the fall-back methods? I mean
 if I can re-run a "Did you mean" query and if I know that running the
 original query against another wiki has good chances to give results which
 one should I try first?

Yep, that was my point (G):

 (G) Another interesting question: if we end up implementing several option
...
   for improving
search results, we will have to figure out how to stage them
 and in what order to try/test for them. 

...
  I think it's worth running [the cross-wiki] test
regularly and see how
 results change.

I agree.

...
  (B) Make multilingual results configurable—If we know,
say, the top four
 wikis likely to give good results for queries from the English wiki are Sp,
 Fr, DE, and JP, we could have a expanding section (excuse an UI
 ugliness—someone with UI smarts can help us figure out how to make it
 pretty, right?) to enable multi-lingual searching, so on English Wikipedia
 I could ask for “back up results” in Spanish and French, but not German and
 Japanese. Store those settings in a cookie for later, too, possibly with
 some UI indicator that multilingual backup results are enabled. (Also, if
 the cookie is available at query time, we could save unnecessary cross-wiki
 searches the user couldn’t possibly use.)

 There is maybe sensible defaults per language?

I think we can look for defaults per language in terms of where it makes
sense to look based on the fact that we're likely to find something. No
point looking in language X—even if the user can read it—if we never find
anything in X.

But what languages to search really make the most sense per user, don't
they? At least for ranking. I'd much rather have a mediocre result in a
language I can read than a perfect result in a language I can't read. We
could limit it by where we think we'll find something based on our tests,
but the user should be able to further limit results based on whether they
can use them.

...
  (D) Another sneakier idea that came to mind—which may
not be technically
 plausible—would be to find good results in another language and then check
 for links back to wiki articles in the wiki the search came from.

 I don't know if it's technically plausible but AFAIK we have the wikibase
 id in the index so it's should be pretty simple to extract it. Interwiki
 links are stored in wikidata, could we use WDQS for that purpose? With the
 entity ID it should be easy to request the interwiki link for a specific
 language. Is WDQS designed for this usage (high number of query/sec on
 rather simple queries)?

I also thought of WDQS for this. We should ask Stas.

...

 =Misspellings=

 Reducing to 1 char the prefix length can hurt perfs and it's certainly a
 good idea to do this in 2 passes as Erik suggested.

Yeah, I worry about performance with prefix=1—but we can test it in small
scale and see what it costs and how much it helps.

...
  While working on prefixes I tried to analyze data on
simple wiki dump and
 extracted the distribution of term frequency by prefix length. I failed to
 make any good usage of the data yet but I'm sure you will :)
 I described a way to analyze the content we have in the index here:
...

https://wikitech.wikimedia.org/wiki/User:DCausse/Term_Stats_With_Cirrus_Dump
 It's still on a very small dataset but if you find it useful maybe we
 could try on a larger one?

I will take a look!
(Two caveats: I don't really have superpowers, so maybe there's not much
there. I'll add it to my stack, which is getting bigger every day.)

...

 My point here is (in the long term): maybe it's difficult to build good
 suggestions from data directly so why not build a custom dictionary/index
 to handle "Did you mean suggestions"?. According to
 https://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s they learn from
 search queries to build these suggestions. Is this something worth trying?

Definitely worth trying. Erik's got a patch in for adding a session id (
https://gerrit.wikimedia.org/r/#/c/226466/ ). In addition to identifying
prefix searches that come right before the user finishes typing their full
text query, this would be good for looking for zero-results (or even
low-results) queries followed by a similarly-spelled successful query from
the same search session. "saerch" followed by "search" gives us hints
that
the latter is a good suggestion for the former—especially if it happens a
lot.

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Wikimedia-search] Fwd: Zero search results—how can I help?