Le 29/07/2015 19:26, Trey Jones a écrit :
(Thoughts are cloudy with a chance of brainstorming)
Hey guys I saw part of your discussion on IRC about testing whether reverse indexes help. I couldn’t reply there at the time, so I started thinking about it. This unfortunately long email is the result. (Sorry.)
No problem, I like reading your mails :)
While it would be good to know how the reverse index helps on a wiki of more manageable size like frwiki, I wouldn’t necessarily expect the patterns of typos to be the same between enwiki and frwiki (or any other language wiki)—language phonotactics & orthography, keyboard layout, mobile use, and user demographics could all have an effect on the type and frequency of typos. So a reverse index could generally be useful in one language and not in another—in theory it wouldn’t hurt to test specifically on any large wiki where the cost of adding the reverse index is non-trivial.
We have some technical restrictions here, if we activate this settings on one wiki we'll need to reindex most of the wikis because we have cross-wiki searches. wikiA can query wikiB's index, if wikiB index is not updated with correct settings the query will fail. The cross wiki queries I know so far are : - all wikis can query commons.wikimedia.org index - itwiki will query all its sister projects (itwiktionary, itwikivoyage, itwikibooks ...) - maybe more
So it's hard to work with mixed settings with the current architecture :(
I’m trying to think of ways to extrapolate from a sample of some sort. I’m spit-balling and thinking through as I type—I don’t know if any of these are good ideas, but maybe one will lead to a better idea.
Do we know what percentage of searches (in enwiki or in general) match article titles? We could extract article titles and search against those with and without a reverse index as a test.
Or, is it possible to get a reasonably sized random subset of enwiki, say 10-20%? If so, you could run a sample of non-zero queries against it and determine that, say, 47% of queries that get results on the full wiki also get results on this partial wiki… and the run the zero queries with a reverse index and extrapolate.
We can dump a subset of enwiki, the dump tool we use has a --limit param. Unfortunately I have absolutely no idea if the subset will be representative. There is likely a phenomenon similar to db dumps: old docs will be dumped first, for lucene old docs generally means docs that has never been updated, in other words it will be pages that are not very interesting.
Hmm… if none of the relevant search elements rely on anything other than the presence of terms in a document, then you could make a “compact” version of enwiki, where each document keeps only one instance of each word in it. A quick hacky test on a handful of medium to longish documents gives compression of 30-50% per document, if that’s enough to matter. Of course, term frequency, proximity, and other things would be wildly skewed—but “is it in the index?” would work.
It's a good idea but I don't know how to dump this info, there's no easy way to dump the index lexicon in production. Another (similar idea) would be to dump only the fields needed for the suggester to work.
The suggester works with title and redirect only, in theory we could dump only these fields which would result in something like 200Mb gzip files for enwiki. Unfortunately I don't have this option in the dump script :( I think it's the best way to go but : - we need to change the dump tool to filter a selected set of fields - I never tested this tool in production, I don't know if it'll hurt perf. I guess it's OK because it's somewhat the same process that is done with inplace reindex.
Actually, of all you need is “is it in the index?” you could just dump a list of words in the index and run searches against that.
That's a bit trickier, we need to run the phrase suggester query, it'd be hard to simulate its behaviour. Hopefull we can run this "phrase suggester" by hand with an elasticsearch request.
Okay… here’s an idea: tokenize the zero-result queries and search individual tokens against a list of terms indexed in enwiki, with and without a reverse index.
The suggester works with shingles (word grams of size 1, 2 and 3). Maybe it makes sense to run the queries against the word unigrams... but this will definitely be harder than running the elasticsearch suggest query.
None of these will give exact results, but various incarnations would give upper and lower bounds on the usefulness of the reverse index. For example, if only 0.05% of query tokens, in 0.07% of queries, are found only by the reverse index, it probably isn’t going to help. If 75% of them are, then it probably is.
Agreed,
To sum up, here is a reasonable process to check if the reverse field is worth a try:
- Add an option to filter a subset of fields to dumpIndex - Extract a subset of full text searches that returned zero result and no suggestions (en, fr, de, it and es would be a good start?) - Dump title and redirect fields from these wikis - Import this data into an elasticsearch instance with the reverse field activated (on labs?) - write a small script that runs phrase suggester queries - run the phrase suggester query and count
Note that we will not be able to measure things like : search is a better than samech for the query saerch.
This seems impossible to check without human review. We could do another run with queries where a suggestion was found and generate a diff that will be reviewed by hand:
user_query: saerch prod_suggestion: samech with_reverse: search
Thanks for all the technical details! So much going on... so much to learn!
I didn't know/remember that suggester only works on titles and redirects. Then, obviously, using just that would be great! That's gotta be a 98%+ reduction in text.
I like your reasonable process—it's quite reasonable!
You asked about which wikis to look at. Are en, fr, de, it and es the ones we can best read? (I'm okay with that to start, by the way—it optimizes developer time.) By number of zero-result queries from my 500K sample, the top five are en, de, pt, ja, and ru—though that sample is small. By overall size, it's en, sv, de, nl, and fr. Clearly enwiki dominates, and I'm guessing the performance will differ across languages—so I don't have a clear suggestion here. But enwiki makes sense because it's the biggest on every front, and itwiki, because it does the most interesting crosswiki stuff.
Hmm. Is enwiki big enough to drag everything else along if it's very beneficial there?
We have some technical restrictions here, if we activate this settings on
one wiki we'll need to reindex most of the wikis because we have cross-wiki searches.
wikiA can query wikiB's index, if wikiB index is not updated with correct
settings the query will fail.
...
So it's hard to work with mixed settings with the current architecture :(
I'm a bit confused. Will elasticsearch do really bad things if you ask it to search in a way that isn't enabled on a particular index? Does fail mean zero results, or does it waste lots of CPU and start throwing errors? Is there a reasonable way to assess what features a query needs and whether a given index supports those features? Sounds terribly ugly, but I had to ask.
Note that we will not be able to measure things like :
search is a better than samech for the query saerch.
This seems impossible to check without human review. We could do another run with queries where a suggestion was found and generate a diff that will be reviewed by hand:
user_query: saerch prod_suggestion: samech with_reverse: search
Are you thinking of manual review of the suggestions, or of a diff of the results of the suggestions? I'm assuming just looking at the terms—I feel that a fluent speaker could easily tell that search is better than samech just by looking at the words. (So I could help review in English, at least.)
That said, there are two things I can think of that would make for at least a weak heuristic: edit distance and frequency.
Since there only going to be a small number of suggestions in each case, running full edit distance on them offline wouldn't be too costly. There are many versions of edit distance you could use. With plain dumb E.D., these are both distance 2, but with reversals counting less than a full insert + delete, "search" is better than "samech". You can also do more generic weighted edit distance to allow typos (x is more likely for z than p for z) or likely spelling errors (mixing up vowels or double vs single letters) to count less.
As for frequency, you could look at overall term frequency or document frequency in the index, or if that's too expensive, get a generic frequency list for the language in question. "search" is clearly better than "samech" by any frequency metric.
We could take hand-reviewed results (seems like it'd be quick work—I'd do a pile from enwiki) as training data to fit a model that would allow us to predict which suggestions are likely to be better.
If/when we do roll it out to production, we could obviously further test by giving multiple suggestions and seeing which ones users like.
—Trey
Le 30/07/2015 16:50, Trey Jones a écrit :
Thanks for all the technical details! So much going on... so much to learn!
I didn't know/remember that suggester only works on titles and redirects. Then, obviously, using just that would be great! That's gotta be a 98%+ reduction in text.
Yes and Erik suggested that we could try to inject more content. This option already exists and we could turn it on, but I suspect it was disabled in wmf config for good reasons.
I like your reasonable process—it's quite reasonable!
You asked about which wikis to look at. Are en, fr, de, it and es the ones we can best read? (I'm okay with that to start, by the way—it optimizes developer time.) By number of zero-result queries from my 500K sample, the top five are en, de, pt, ja, and ru—though that sample is small. By overall size, it's en, sv, de, nl, and fr. Clearly enwiki dominates, and I'm guessing the performance will differ across languages—so I don't have a clear suggestion here. But enwiki makes sense because it's the biggest on every front, and itwiki, because it does the most interesting crosswiki stuff.
Hmm. Is enwiki big enough to drag everything else along if it's very beneficial there?
If we have a process that works for enwiki it'd be "easy" to reiterate over other wikis. I'd say we could start with enwiki.
We have some technical restrictions here, if we activate this settings on one wiki we'll need to reindex most of the wikis because we have cross-wiki searches. wikiA can query wikiB's index, if wikiB index is not updated with correct settings the query will fail.
...
So it's hard to work with mixed settings with the current architecture :(
I'm a bit confused. Will elasticsearch do really bad things if you ask it to search in a way that isn't enabled on a particular index? Does fail mean zero results, or does it waste lots of CPU and start throwing errors? Is there a reasonable way to assess what features a query needs and whether a given index supports those features? Sounds terribly ugly, but I had to ask.
"Fails" means a big red message displayed to the user :) Elasticsearch can run a single query over multiple indexes. In the case you ask for a suggest field that's missing in one of the index you requested the whole query will fail. Today we have a config per wiki and not a config per index, having a config per index would imply a big refactoring and we would have to drop this convenient "multi-index" feature.
Note that we will not be able to measure things like : search is a better than samech for the query saerch. This seems impossible to check without human review. We could do another run with queries where a suggestion was found and generate a diff that will be reviewed by hand: user_query: saerch prod_suggestion: samech with_reverse: search
Are you thinking of manual review of the suggestions, or of a diff of the results of the suggestions? I'm assuming just looking at the terms—I feel that a fluent speaker could easily tell that search is better than samech just by looking at the words. (So I could help review in English, at least.)
Yes the idea was to extract only the suggestions that differ from the one we have in search logs.
That said, there are two things I can think of that would make for at least a weak heuristic: edit distance and frequency.
Since there only going to be a small number of suggestions in each case, running full edit distance on them offline wouldn't be too costly. There are many versions of edit distance you could use. With plain dumb E.D., these are both distance 2, but with reversals counting less than a full insert + delete, "search" is better than "samech". You can also do more generic weighted edit distance to allow typos (x is more likely for z than p for z) or likely spelling errors (mixing up vowels or double vs single letters) to count less.
As for frequency, you could look at overall term frequency or document frequency in the index, or if that's too expensive, get a generic frequency list for the language in question. "search" is clearly better than "samech" by any frequency metric.
With an index in lab I can extract the frequencies, you'll have something like :
search:1345 search engine:122 google search:32 google search engine:2
You will have to filter on space to keep only unigrams if it's better for you.
We could take hand-reviewed results (seems like it'd be quick work—I'd do a pile from enwiki) as training data to fit a model that would allow us to predict which suggestions are likely to be better.
If/when we do roll it out to production, we could obviously further test by giving multiple suggestions and seeing which ones users like.
This is another very good idea :)
On Thu, Jul 30, 2015 at 6:31 AM, David Causse dcausse@wikimedia.org wrote:
The cross wiki queries I know so far are :
- all wikis can query commons.wikimedia.org index
- itwiki will query all its sister projects (itwiktionary, itwikivoyage,
itwikibooks ...)
- maybe more
So it's hard to work with mixed settings with the current architecture :(
That's it. For now ;-)
And technically itwiki is dogfooding the IW search, so if we had to turn it off temporarily to support a migration it'd be ok.
Generally we've tried to avoid doing this though by making mapping/analysis changes in a back-compat way or hiding the alternative behavior behind a feature flag until we've completed reindexing.
-Chad
Le 30/07/2015 17:13, Chad Horohoe a écrit :
Generally we've tried to avoid doing this though by making mapping/analysis changes in a back-compat way or hiding the alternative behavior behind a feature flag until we've completed reindexing.
Yes, I think it's the best way to go, trying to optimize something with mixed settings/mappings seems to be quite dangerous and extremely hard to test.
wikimedia-search@lists.wikimedia.org