On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones <tjones@wikimedia.org> wrote:There are several proposals for improving language detection in the etherpad, and we can work on them in parallel
My worry here is we would then need to productionize it. Several of the options i see are basically libraries that we would have to build a service (or ES plugin) around. I do think we should investigate this and decide if the effort to productionize is worth the impact we are able to estimate in relevance lab.
We need training and evaluation data.
This is probably the biggest sticking point. Another random idea: We have speakers of several languages on the team and in the foundation (as in, under NDA and can review queries that are PII), would it be enough to grab example queries from wiki's of the correct language and then have someone that knows the language filter through them and delete nonsensical / wrong language queries? I'm guessing this would go faster, but not sure it's as valuable.
I'm somewhat worried about being able to reduce the targeted zero results rate by 10%. In my test, only 12% of non-DOI zero-results queries were "in a language", and only about a third got results when searched in the "correct" (human-determined) wiki. I didn't filter bots other than the DOI bot, and some non-language queries (e.g., names) might get results in another wiki, but there may not be enough wiggle room. There's a lot of junk in other languages, too, but maybe filtering bots will help more than I dare presume.
I'm also worried about that portion, but perhaps a nuanced reading could help us? If a 10% increase in satisfaction is 15% -> 16.5%, then a 10% reduction in ZRR is 30% -> 27%. We don't yet have the numbers for non-automata so it's harder to say what exactly it is, but we finally have the data into hadoop which should make it possible to determine non-automata related issues.