Thanks for all the technical details! So much going on... so much to learn!
I didn't know/remember that suggester only works on titles and redirects.
Then, obviously, using just that would be great! That's gotta be a 98%+
reduction in text.
I like your reasonable process—it's quite reasonable!
You asked about which wikis to look at. Are en, fr, de, it and es the ones
we can best read? (I'm okay with that to start, by the way—it optimizes
developer time.) By number of zero-result queries from my 500K sample, the
top five are en, de, pt, ja, and ru—though that sample is small. By overall
size, it's en, sv, de, nl, and fr. Clearly enwiki dominates, and I'm
guessing the performance will differ across languages—so I don't have a
clear suggestion here. But enwiki makes sense because it's the biggest on
every front, and itwiki, because it does the most interesting crosswiki
stuff.
Hmm. Is enwiki big enough to drag everything else along if it's very
beneficial there?
We have some technical restrictions here, if we activate this settings on
one wiki we'll need to reindex most of the wikis
because we have cross-wiki
searches.
wikiA can query wikiB's index, if wikiB index is not updated with correct
settings the query will fail.
...
So it's hard to work with mixed settings with the
current architecture :(
I'm a bit confused. Will elasticsearch do really bad things if you ask it
to search in a way that isn't enabled on a particular index? Does fail mean
zero results, or does it waste lots of CPU and start throwing errors? Is
there a reasonable way to assess what features a query needs and whether a
given index supports those features? Sounds terribly ugly, but I had to ask.
Note that we will not be able to measure things like :
search is a better than samech for the query saerch.
This seems impossible to check without human review.
We could do another
run with queries where a suggestion was found and generate a diff that will
be reviewed by hand:
user_query: saerch
prod_suggestion: samech
with_reverse: search
Are you thinking of manual review of the suggestions, or of a diff of the
results of the suggestions? I'm assuming just looking at the terms—I feel
that a fluent speaker could easily tell that search is better than samech
just by looking at the words. (So I could help review in English, at least.)
That said, there are two things I can think of that would make for at least
a weak heuristic: edit distance and frequency.
Since there only going to be a small number of suggestions in each case,
running full edit distance on them offline wouldn't be too costly. There
are many versions of edit distance you could use. With plain dumb E.D.,
these are both distance 2, but with reversals counting less than a full
insert + delete, "search" is better than "samech". You can also do
more
generic weighted edit distance to allow typos (x is more likely for z than
p for z) or likely spelling errors (mixing up vowels or double vs single
letters) to count less.
As for frequency, you could look at overall term frequency or document
frequency in the index, or if that's too expensive, get a generic frequency
list for the language in question. "search" is clearly better than
"samech"
by any frequency metric.
We could take hand-reviewed results (seems like it'd be quick work—I'd do a
pile from enwiki) as training data to fit a model that would allow us to
predict which suggestions are likely to be better.
If/when we do roll it out to production, we could obviously further test by
giving multiple suggestions and seeing which ones users like.
—Trey