Re: [Wikimedia-search] testing the value of a reverse index

30 Jul 2015

Le 30/07/2015 16:50, Trey Jones a écrit :
...
  Thanks for all the technical details! So much going
on... so much to 
 learn!

 I didn't know/remember that suggester only works on titles and 
 redirects. Then, obviously, using just that would be great! That's 
 gotta be a 98%+ reduction in text. 
Yes and Erik suggested that we could try to inject more content.
This option already exists and we could turn it on, but I suspect it was 
disabled in wmf config for good reasons.

...

 I like your reasonable process—it's quite reasonable!

 You asked about which wikis to look at. Are en, fr, de, it and es the 
 ones we can best read? (I'm okay with that to start, by the way—it 
 optimizes developer time.) By number of zero-result queries from my 
 500K sample, the top five are en, de, pt, ja, and ru—though that 
 sample is small. By overall size, it's en, sv, de, nl, and fr. Clearly 
 enwiki dominates, and I'm guessing the performance will differ across 
 languages—so I don't have a clear suggestion here. But enwiki makes 
 sense because it's the biggest on every front, and itwiki, because it 
 does the most interesting crosswiki stuff.

 Hmm. Is enwiki big enough to drag everything else along if it's very 
 beneficial there? 
If we have a process that works for enwiki it'd be "easy" to reiterate 
over other wikis. I'd say we could start with enwiki.

...

     We have some technical restrictions here, if we activate this
     settings on one wiki we'll need to reindex most of the wikis
     because we have cross-wiki searches.

     wikiA can query wikiB's index, if wikiB index is not updated with
     correct settings the query will fail.

 ...

     So it's hard to work with mixed settings with the current
     architecture :(

 I'm a bit confused. Will elasticsearch do really bad things if you ask 
 it to search in a way that isn't enabled on a particular index? Does 
 fail mean zero results, or does it waste lots of CPU and start 
 throwing errors? Is there a reasonable way to assess what features a 
 query needs and whether a given index supports those features? Sounds 
 terribly ugly, but I had to ask. 
"Fails" means a big red message displayed to the user :)
Elasticsearch can run a single query over multiple indexes. In the case 
you ask for a suggest field that's missing in one of the index you 
requested the whole query will fail.
Today we have a config per wiki and not a config per index, having a 
config per index would imply a big refactoring and we would have to drop 
this convenient "multi-index" feature.

...

     Note that we will not be able to measure things like :
     search is a better than samech for the query saerch.

     This seems impossible to check without human review. We could do
     another run with queries where a suggestion was found and generate
     a diff that will be reviewed by hand: 

     user_query: saerch
     prod_suggestion: samech
     with_reverse: search

 Are you thinking of manual review of the suggestions, or of a diff of 
 the results of the suggestions? I'm assuming just looking at the 
 terms—I feel that a fluent speaker could easily tell that search is 
 better than samech just by looking at the words. (So I could help 
 review in English, at least.) 
Yes the idea was to extract only the suggestions that differ from the 
one we have in search logs.

...

 That said, there are two things I can think of that would make for at 
 least a weak heuristic: edit distance and frequency.

 Since there only going to be a small number of suggestions in each 
 case, running full edit distance on them offline wouldn't be too 
 costly. There are many versions of edit distance you could use. With 
 plain dumb E.D., these are both distance 2, but with reversals 
 counting less than a full insert + delete, "search" is better than 
 "samech". You can also do more generic weighted edit distance to allow 
 typos (x is more likely for z than p for z) or likely spelling errors 
 (mixing up vowels or double vs single letters) to count less.

 As for frequency, you could look at overall term frequency or document 
 frequency in the index, or if that's too expensive, get a generic 
 frequency list for the language in question. "search" is clearly 
 better than "samech" by any frequency metric. 
With an index in lab I can extract the frequencies, you'll have 
something like :

search:1345
search engine:122
google search:32
google search engine:2

You will have to filter on space to keep only unigrams if it's better 
for you.

...

 We could take hand-reviewed results (seems like it'd be quick work—I'd 
 do a pile from enwiki) as training data to fit a model that would 
 allow us to predict which suggestions are likely to be better.

 If/when we do roll it out to production, we could obviously further 
 test by giving multiple suggestions and seeing which ones users like. 
This is another very good idea :)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Wikimedia-search] testing the value of a reverse index