New subject: [Wikimedia-search] testing the value of a reverse index

30 Jul 2015

Le 29/07/2015 19:26, Trey Jones a écrit :
...
  (Thoughts are cloudy with a chance of brainstorming)

 Hey guys I saw part of your discussion on IRC about testing whether 
 reverse indexes help. I couldn’t reply there at the time, so I started 
 thinking about it. This unfortunately long email is the result. (Sorry.) 
No problem, I like reading your mails :)

...

 While it would be good to know how the reverse index helps on a wiki 
 of more manageable size like frwiki, I wouldn’t necessarily expect the 
 patterns of typos to be the same between enwiki and frwiki (or any 
 other language wiki)—language phonotactics & orthography, keyboard 
 layout, mobile use, and user demographics could all have an effect on 
 the type and frequency of typos. So a reverse index could generally be 
 useful in one language and not in another—in theory it wouldn’t hurt 
 to test specifically on any large wiki where the cost of adding the 
 reverse index is non-trivial. 
We have some technical restrictions here, if we activate this settings 
on one wiki we'll need to reindex most of the wikis because we have 
cross-wiki searches.
wikiA can query wikiB's index, if wikiB index is not updated with 
correct settings the query will fail.
The cross wiki queries I know so far are :
- all wikis can query commons.wikimedia.org index
- itwiki will query all its sister projects (itwiktionary, itwikivoyage, 
itwikibooks ...)
- maybe more

So it's hard to work with mixed settings with the current architecture :(

...

 I’m trying to think of ways to extrapolate from a sample of some sort. 
 I’m spit-balling and thinking through as I type—I don’t know if any of 
 these are good ideas, but maybe one will lead to a better idea.

 Do we know what percentage of searches (in enwiki or in general) match 
 article titles? We could extract article titles and search against 
 those with and without a reverse index as a test.

 Or, is it possible to get a reasonably sized random subset of enwiki, 
 say 10-20%? If so, you could run a sample of non-zero queries against 
 it and determine that, say, 47% of queries that get results on the 
 full wiki also get results on this partial wiki… and the run the zero 
 queries with a reverse index and extrapolate. 
We can dump a subset of enwiki, the dump tool we use has a --limit 
param. Unfortunately I have absolutely no idea if the subset will be 
representative. There is likely a phenomenon similar to db dumps: old 
docs will be dumped first, for lucene old docs generally means docs that 
has never been updated, in other words it will be pages that are not 
very interesting.

...

 Hmm… if none of the relevant search elements rely on anything other 
 than the presence of terms in a document, then you could make a 
 “compact” version of enwiki, where each document keeps only one 
 instance of each word in it. A quick hacky test on a handful of medium 
 to longish documents gives compression of 30-50% per document, if 
 that’s enough to matter. Of course, term frequency, proximity, and 
 other things would be wildly skewed—but “is it in the index?” would work. 
It's a good idea but I don't know how to dump this info, there's no easy 
way to dump the index lexicon in production.
Another (similar idea) would be to dump only the fields needed for the 
suggester to work.

The suggester works with title and redirect only, in theory we could 
dump only these fields which would result in something like 200Mb gzip 
files for enwiki. Unfortunately I don't have this option in the dump 
script :(
I think it's the best way to go but :
- we need to change the dump tool to filter a selected set of fields
- I never tested this tool in production, I don't know if it'll hurt 
perf. I guess it's OK because it's somewhat the same process that is 
done with inplace reindex.

...

 Actually, of all you need is “is it in the index?” you could just dump 
 a list of words in the index and run searches against that. 
That's a bit trickier, we need to run the phrase suggester query, it'd 
be hard to simulate its behaviour. Hopefull we can run this "phrase 
suggester" by hand with an elasticsearch request.

...

 Okay… here’s an idea: tokenize the zero-result queries and search 
 individual tokens against a list of terms indexed in enwiki, with and 
 without a reverse index. 
The suggester works with shingles (word grams of size 1, 2 and 3). Maybe 
it makes sense to run the queries against the word unigrams... but this 
will definitely be harder than running the elasticsearch suggest query.

...

 None of these will give exact results, but various incarnations would 
 give upper and lower bounds on the usefulness of the reverse index. 
 For example, if only 0.05% of query tokens, in 0.07% of queries, are 
 found only by the reverse index, it probably isn’t going to help. If 
 75% of them are, then it probably is. 
Agreed,

To sum up, here is a reasonable process to check if the reverse field is 
worth a try:

- Add an option to filter a subset of fields to dumpIndex
- Extract a subset of full text searches that returned zero result and 
no suggestions (en, fr, de, it and es would be a good start?)
- Dump title and redirect fields from these wikis
- Import this data into an elasticsearch instance with the reverse field 
activated (on labs?)
- write a small script that runs phrase suggester queries
- run the phrase suggester query and count

Note that we will not be able to measure things like :
search is a better than samech for the query saerch.

This seems impossible to check without human review. We could do another 
run with queries where a suggestion was found and generate a diff that 
will be reviewed by hand:

user_query: saerch
prod_suggestion: samech
with_reverse: search

Re: [Wikimedia-search] testing the value of a reverse index