Now that we have the feature deployed (behind a feature flag), and have an initial "does it do anything?" test going out today, along with an upcoming integration with our satisfaction metrics, we need to come up with how will will try to further move the needle forward.
For reference these are our Q2 goals:
- Run A/B test for a feature that: - Uses a library to detect the language of a user's search query. - Adjusts results to match that language. - Determine from A/B test results whether this feature is fit to push to production, with the aim to: - Improve search user satisfaction by 10% (from 15% to 16.5%). - Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We might want to take into consideration which of these can be pre-tested with our relevancy lab work, such that we can prefer to work on things we think will move the needle the most. I'm really not sure which of these to push forward on, so let us know which you think can have the most impact, or where the expected impact could be measured with relevancy lab with minimal work.
Define this "does it do anything?" test?
On 2 November 2015 at 19:58, Erik Bernhardson ebernhardson@wikimedia.org wrote:
Now that we have the feature deployed (behind a feature flag), and have an initial "does it do anything?" test going out today, along with an upcoming integration with our satisfaction metrics, we need to come up with how will will try to further move the needle forward.
For reference these are our Q2 goals:
Run A/B test for a feature that:
Uses a library to detect the language of a user's search query. Adjusts results to match that language.
Determine from A/B test results whether this feature is fit to push to production, with the aim to:
Improve search user satisfaction by 10% (from 15% to 16.5%). Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We might want to take into consideration which of these can be pre-tested with our relevancy lab work, such that we can prefer to work on things we think will move the needle the most. I'm really not sure which of these to push forward on, so let us know which you think can have the most impact, or where the expected impact could be measured with relevancy lab with minimal work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
It measures the zero results rate for 1 in 10 search requests via CirrusSearchUserTesting log that we used last quarter.
On Mon, Nov 2, 2015 at 6:01 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Define this "does it do anything?" test?
On 2 November 2015 at 19:58, Erik Bernhardson ebernhardson@wikimedia.org wrote:
Now that we have the feature deployed (behind a feature flag), and have
an
initial "does it do anything?" test going out today, along with an
upcoming
integration with our satisfaction metrics, we need to come up with how
will
will try to further move the needle forward.
For reference these are our Q2 goals:
Run A/B test for a feature that:
Uses a library to detect the language of a user's search query. Adjusts results to match that language.
Determine from A/B test results whether this feature is fit to push to production, with the aim to:
Improve search user satisfaction by 10% (from 15% to 16.5%). Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We might want to take into consideration which of these can be pre-tested with our relevancy lab work, such that we can prefer to work on things we think
will
move the needle the most. I'm really not sure which of these to push
forward
on, so let us know which you think can have the most impact, or where the expected impact could be measured with relevancy lab with minimal work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Sorry I didn't respond to this sooner!
I really like the idea of trying to detect what languages the user can read, and searching in (a subset of) those. This wouldn't benefit from relevance lab testing, though. It'll need to be measured against the user satisfaction metric. (BTW, Do we have a sense of how many users have info we can detect for this?)
I think the biggest problem with language detection is the quality of the language detector. The Elastic Search plugin we tested has a Romanian fetish when run on our queries (Erik got about 38% Romanian on 100K enwiki searches, which is crazy, and I got 0% accuracy for Romanian on my much smaller tagged corpus of failed (zero results) queries to enwiki). Most of the time, I would expect queries sent to the wrong wiki to fail (though there are some exceptions)—but a query in English that does get hits in rowiki is going to just look wrong most of the time.
There are several proposals for improving language detection in the etherpad, and we can work on them in parallel, since any given one could be better than any other one. (We don't want to make 100 of them, but a few to test and compare would be nice—there may also be reasonable speed/accuracy tradeoffs to be made, e.g., 2% decrease in accuracy for 2x speed is a good deal.)
We need training and evaluation data. I see a few ways of getting it. The easy, lower-quality way is just take queries from a given wiki and assume they are in the language in question (i.e., eswiki queries are in Spanish). Easy, not 100% accurate, unlimited supply. The hard, higher-quality way is to hand annotate a corpus of queries. This is slow, but doable. I can do on the order of 1000 queries in a day—more if I were less accurate and more willing to toss stuff into the junk pile. I couldn't do it for a week straight, though, without going crazy. A possible middle of the road approach would be to create a feedback loop and run detectors on our training data and review and remove items that are not in the desired language (we could also start by filtering things that are not in the right character set, like removing all Arabic, Cyrillic, and Chinese from enwiki, frwiki, and eswiki queries). If we want thousands of hand-annotated queries, we need to get annotating!
I think we can use the relevance lab to help evaluate a language detector (at least with respect to zero results rate). We could run the detector against a pile of zero-results queries, then group the queries by detected language, and run them against the relevant wiki (if we have room in labs for the indexes, and we update the relevance lab tools to support choosing a target wiki to search). We wouldn't be comparing "before" and "after", but just measuring the zero results rate against the target wiki. As any time we're using zero-results rate, there's no guarantee that we'll be giving good results, just results (e.g., "unix time stamp" queries with English words fail on enwiki but sometimes work on zhwiki for some reason, but that's not really better.)
I'm somewhat worried about being able to reduce the targeted zero results rate by 10%. In my test[1], only 12% of non-DOI zero-results queries were "in a language", and only about a third got results when searched in the "correct" (human-determined) wiki. I didn't filter bots other than the DOI bot, and some non-language queries (e.g., names) might get results in another wiki, but there may not be enough wiggle room. There's a lot of junk in other languages, too, but maybe filtering bots will help more than I dare presume.
[1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_S...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Nov 2, 2015 at 9:03 PM, Erik Bernhardson <ebernhardson@wikimedia.org
wrote:
It measures the zero results rate for 1 in 10 search requests via CirrusSearchUserTesting log that we used last quarter.
On Mon, Nov 2, 2015 at 6:01 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Define this "does it do anything?" test?
On 2 November 2015 at 19:58, Erik Bernhardson ebernhardson@wikimedia.org wrote:
Now that we have the feature deployed (behind a feature flag), and have
an
initial "does it do anything?" test going out today, along with an
upcoming
integration with our satisfaction metrics, we need to come up with how
will
will try to further move the needle forward.
For reference these are our Q2 goals:
Run A/B test for a feature that:
Uses a library to detect the language of a user's search query. Adjusts results to match that language.
Determine from A/B test results whether this feature is fit to push to production, with the aim to:
Improve search user satisfaction by 10% (from 15% to 16.5%). Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We
might
want to take into consideration which of these can be pre-tested with
our
relevancy lab work, such that we can prefer to work on things we think
will
move the needle the most. I'm really not sure which of these to push
forward
on, so let us know which you think can have the most impact, or where
the
expected impact could be measured with relevancy lab with minimal work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
So do we think we should favor the "try to guess the user's language(s)" item over others that would benefit from the relevance lab? Are there steps we could/should take in advance, such as analyzing whatever user language data we have, or instrumenting to get more if we don't have enough?
Kevin Smith Agile Coach, Wikimedia Foundation
On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones tjones@wikimedia.org wrote:
Sorry I didn't respond to this sooner!
I really like the idea of trying to detect what languages the user can read, and searching in (a subset of) those. This wouldn't benefit from relevance lab testing, though. It'll need to be measured against the user satisfaction metric. (BTW, Do we have a sense of how many users have info we can detect for this?)
I think the biggest problem with language detection is the quality of the language detector. The Elastic Search plugin we tested has a Romanian fetish when run on our queries (Erik got about 38% Romanian on 100K enwiki searches, which is crazy, and I got 0% accuracy for Romanian on my much smaller tagged corpus of failed (zero results) queries to enwiki). Most of the time, I would expect queries sent to the wrong wiki to fail (though there are some exceptions)—but a query in English that does get hits in rowiki is going to just look wrong most of the time.
There are several proposals for improving language detection in the etherpad, and we can work on them in parallel, since any given one could be better than any other one. (We don't want to make 100 of them, but a few to test and compare would be nice—there may also be reasonable speed/accuracy tradeoffs to be made, e.g., 2% decrease in accuracy for 2x speed is a good deal.)
We need training and evaluation data. I see a few ways of getting it. The easy, lower-quality way is just take queries from a given wiki and assume they are in the language in question (i.e., eswiki queries are in Spanish). Easy, not 100% accurate, unlimited supply. The hard, higher-quality way is to hand annotate a corpus of queries. This is slow, but doable. I can do on the order of 1000 queries in a day—more if I were less accurate and more willing to toss stuff into the junk pile. I couldn't do it for a week straight, though, without going crazy. A possible middle of the road approach would be to create a feedback loop and run detectors on our training data and review and remove items that are not in the desired language (we could also start by filtering things that are not in the right character set, like removing all Arabic, Cyrillic, and Chinese from enwiki, frwiki, and eswiki queries). If we want thousands of hand-annotated queries, we need to get annotating!
I think we can use the relevance lab to help evaluate a language detector (at least with respect to zero results rate). We could run the detector against a pile of zero-results queries, then group the queries by detected language, and run them against the relevant wiki (if we have room in labs for the indexes, and we update the relevance lab tools to support choosing a target wiki to search). We wouldn't be comparing "before" and "after", but just measuring the zero results rate against the target wiki. As any time we're using zero-results rate, there's no guarantee that we'll be giving good results, just results (e.g., "unix time stamp" queries with English words fail on enwiki but sometimes work on zhwiki for some reason, but that's not really better.)
I'm somewhat worried about being able to reduce the targeted zero results rate by 10%. In my test[1], only 12% of non-DOI zero-results queries were "in a language", and only about a third got results when searched in the "correct" (human-determined) wiki. I didn't filter bots other than the DOI bot, and some non-language queries (e.g., names) might get results in another wiki, but there may not be enough wiggle room. There's a lot of junk in other languages, too, but maybe filtering bots will help more than I dare presume.
[1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_S...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Nov 2, 2015 at 9:03 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
It measures the zero results rate for 1 in 10 search requests via CirrusSearchUserTesting log that we used last quarter.
On Mon, Nov 2, 2015 at 6:01 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Define this "does it do anything?" test?
On 2 November 2015 at 19:58, Erik Bernhardson ebernhardson@wikimedia.org wrote:
Now that we have the feature deployed (behind a feature flag), and
have an
initial "does it do anything?" test going out today, along with an
upcoming
integration with our satisfaction metrics, we need to come up with how
will
will try to further move the needle forward.
For reference these are our Q2 goals:
Run A/B test for a feature that:
Uses a library to detect the language of a user's search query. Adjusts results to match that language.
Determine from A/B test results whether this feature is fit to push to production, with the aim to:
Improve search user satisfaction by 10% (from 15% to 16.5%). Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We
might
want to take into consideration which of these can be pre-tested with
our
relevancy lab work, such that we can prefer to work on things we think
will
move the needle the most. I'm really not sure which of these to push
forward
on, so let us know which you think can have the most impact, or where
the
expected impact could be measured with relevancy lab with minimal work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
In terms of user language data we have, within the webrequests table in hive we have the accept language header and we have geolocation information. This table also contains the query strings so we can extract the exact search terms and feed that information into relevancy lab.
On Tue, Nov 3, 2015 at 3:29 PM, Kevin Smith ksmith@wikimedia.org wrote:
So do we think we should favor the "try to guess the user's language(s)" item over others that would benefit from the relevance lab? Are there steps we could/should take in advance, such as analyzing whatever user language data we have, or instrumenting to get more if we don't have enough?
Kevin Smith Agile Coach, Wikimedia Foundation
On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones tjones@wikimedia.org wrote:
Sorry I didn't respond to this sooner!
I really like the idea of trying to detect what languages the user can read, and searching in (a subset of) those. This wouldn't benefit from relevance lab testing, though. It'll need to be measured against the user satisfaction metric. (BTW, Do we have a sense of how many users have info we can detect for this?)
I think the biggest problem with language detection is the quality of the language detector. The Elastic Search plugin we tested has a Romanian fetish when run on our queries (Erik got about 38% Romanian on 100K enwiki searches, which is crazy, and I got 0% accuracy for Romanian on my much smaller tagged corpus of failed (zero results) queries to enwiki). Most of the time, I would expect queries sent to the wrong wiki to fail (though there are some exceptions)—but a query in English that does get hits in rowiki is going to just look wrong most of the time.
There are several proposals for improving language detection in the etherpad, and we can work on them in parallel, since any given one could be better than any other one. (We don't want to make 100 of them, but a few to test and compare would be nice—there may also be reasonable speed/accuracy tradeoffs to be made, e.g., 2% decrease in accuracy for 2x speed is a good deal.)
We need training and evaluation data. I see a few ways of getting it. The easy, lower-quality way is just take queries from a given wiki and assume they are in the language in question (i.e., eswiki queries are in Spanish). Easy, not 100% accurate, unlimited supply. The hard, higher-quality way is to hand annotate a corpus of queries. This is slow, but doable. I can do on the order of 1000 queries in a day—more if I were less accurate and more willing to toss stuff into the junk pile. I couldn't do it for a week straight, though, without going crazy. A possible middle of the road approach would be to create a feedback loop and run detectors on our training data and review and remove items that are not in the desired language (we could also start by filtering things that are not in the right character set, like removing all Arabic, Cyrillic, and Chinese from enwiki, frwiki, and eswiki queries). If we want thousands of hand-annotated queries, we need to get annotating!
I think we can use the relevance lab to help evaluate a language detector (at least with respect to zero results rate). We could run the detector against a pile of zero-results queries, then group the queries by detected language, and run them against the relevant wiki (if we have room in labs for the indexes, and we update the relevance lab tools to support choosing a target wiki to search). We wouldn't be comparing "before" and "after", but just measuring the zero results rate against the target wiki. As any time we're using zero-results rate, there's no guarantee that we'll be giving good results, just results (e.g., "unix time stamp" queries with English words fail on enwiki but sometimes work on zhwiki for some reason, but that's not really better.)
I'm somewhat worried about being able to reduce the targeted zero results rate by 10%. In my test[1], only 12% of non-DOI zero-results queries were "in a language", and only about a third got results when searched in the "correct" (human-determined) wiki. I didn't filter bots other than the DOI bot, and some non-language queries (e.g., names) might get results in another wiki, but there may not be enough wiggle room. There's a lot of junk in other languages, too, but maybe filtering bots will help more than I dare presume.
[1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_S...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Nov 2, 2015 at 9:03 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
It measures the zero results rate for 1 in 10 search requests via CirrusSearchUserTesting log that we used last quarter.
On Mon, Nov 2, 2015 at 6:01 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Define this "does it do anything?" test?
On 2 November 2015 at 19:58, Erik Bernhardson ebernhardson@wikimedia.org wrote:
Now that we have the feature deployed (behind a feature flag), and
have an
initial "does it do anything?" test going out today, along with an
upcoming
integration with our satisfaction metrics, we need to come up with
how will
will try to further move the needle forward.
For reference these are our Q2 goals:
Run A/B test for a feature that:
Uses a library to detect the language of a user's search query. Adjusts results to match that language.
Determine from A/B test results whether this feature is fit to push to production, with the aim to:
Improve search user satisfaction by 10% (from 15% to 16.5%). Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We
might
want to take into consideration which of these can be pre-tested with
our
relevancy lab work, such that we can prefer to work on things we
think will
move the needle the most. I'm really not sure which of these to push
forward
on, so let us know which you think can have the most impact, or where
the
expected impact could be measured with relevancy lab with minimal
work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Taking the above into consideration and reviewing what we have in the brainstorming session, the set of idea seems to be the following:
Do language detection on more than just zero result queries, how about queries that only return 1 or 2 results
- Seems useful and doable, but will only effect satisfaction and not the zero result rate. Still possibly worthwhile.
- This should be relatively easy to test with relevancy lab
1. Determine the language to search in via something other than language detection (headers, geolocation, etc)
- Working up a couple heuristics wouldn't be too hard. The webrequests table in hive has the accept language header and geolocation info as well as the query string, so we could extract a set of queries to test with
1. Integrate wikidata search
- This looks to be https://en.wikipedia.org/wiki/MediaWiki:Wdsearch.js
- We could integrate that more directly, can't be tested by relevancy lab. It is basically just an additional set of results below the existing results.
- Would need a significant cleanup to pass code review, but it's not particularly hard to do
1. Translate the query from the provided language into the language of the wiki being searched on
- This seems "very hard". Not only do we have to correctly detect the language the user input, but then we have to translate that into a second language
- The CX service might be able to provide us a translation endpoint that works with whatever they are currently using, but will likely have high latency. Our inability (currently) to do async requests in php makes it harder to hide that latency.
Build an index that contains the titles from all wikis, but not much else. This could be used to suggest the user search on other wikis (or to inform the code that does actual searches on other wikis)
- This could be somewhat tested in relevancy lab, but first we would have to build something to actually combine all the titles into the same index. -
I think any of the top three could be worked on, the first and the second can be validated through relevancy lab. The third takes a completely different approach and is not easily testable outside of production, but may be useful. The fourth is "very hard" and i think we should leave it alone for now. The fifth and final idea was only put forth once, but is interesting. I'm not sure how valuable it would be though.
On Tue, Nov 3, 2015 at 3:55 PM, Erik Bernhardson <ebernhardson@wikimedia.org
wrote:
In terms of user language data we have, within the webrequests table in hive we have the accept language header and we have geolocation information. This table also contains the query strings so we can extract the exact search terms and feed that information into relevancy lab.
On Tue, Nov 3, 2015 at 3:29 PM, Kevin Smith ksmith@wikimedia.org wrote:
So do we think we should favor the "try to guess the user's language(s)" item over others that would benefit from the relevance lab? Are there steps we could/should take in advance, such as analyzing whatever user language data we have, or instrumenting to get more if we don't have enough?
Kevin Smith Agile Coach, Wikimedia Foundation
On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones tjones@wikimedia.org wrote:
Sorry I didn't respond to this sooner!
I really like the idea of trying to detect what languages the user can read, and searching in (a subset of) those. This wouldn't benefit from relevance lab testing, though. It'll need to be measured against the user satisfaction metric. (BTW, Do we have a sense of how many users have info we can detect for this?)
I think the biggest problem with language detection is the quality of the language detector. The Elastic Search plugin we tested has a Romanian fetish when run on our queries (Erik got about 38% Romanian on 100K enwiki searches, which is crazy, and I got 0% accuracy for Romanian on my much smaller tagged corpus of failed (zero results) queries to enwiki). Most of the time, I would expect queries sent to the wrong wiki to fail (though there are some exceptions)—but a query in English that does get hits in rowiki is going to just look wrong most of the time.
There are several proposals for improving language detection in the etherpad, and we can work on them in parallel, since any given one could be better than any other one. (We don't want to make 100 of them, but a few to test and compare would be nice—there may also be reasonable speed/accuracy tradeoffs to be made, e.g., 2% decrease in accuracy for 2x speed is a good deal.)
We need training and evaluation data. I see a few ways of getting it. The easy, lower-quality way is just take queries from a given wiki and assume they are in the language in question (i.e., eswiki queries are in Spanish). Easy, not 100% accurate, unlimited supply. The hard, higher-quality way is to hand annotate a corpus of queries. This is slow, but doable. I can do on the order of 1000 queries in a day—more if I were less accurate and more willing to toss stuff into the junk pile. I couldn't do it for a week straight, though, without going crazy. A possible middle of the road approach would be to create a feedback loop and run detectors on our training data and review and remove items that are not in the desired language (we could also start by filtering things that are not in the right character set, like removing all Arabic, Cyrillic, and Chinese from enwiki, frwiki, and eswiki queries). If we want thousands of hand-annotated queries, we need to get annotating!
I think we can use the relevance lab to help evaluate a language detector (at least with respect to zero results rate). We could run the detector against a pile of zero-results queries, then group the queries by detected language, and run them against the relevant wiki (if we have room in labs for the indexes, and we update the relevance lab tools to support choosing a target wiki to search). We wouldn't be comparing "before" and "after", but just measuring the zero results rate against the target wiki. As any time we're using zero-results rate, there's no guarantee that we'll be giving good results, just results (e.g., "unix time stamp" queries with English words fail on enwiki but sometimes work on zhwiki for some reason, but that's not really better.)
I'm somewhat worried about being able to reduce the targeted zero results rate by 10%. In my test[1], only 12% of non-DOI zero-results queries were "in a language", and only about a third got results when searched in the "correct" (human-determined) wiki. I didn't filter bots other than the DOI bot, and some non-language queries (e.g., names) might get results in another wiki, but there may not be enough wiggle room. There's a lot of junk in other languages, too, but maybe filtering bots will help more than I dare presume.
[1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_S...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Nov 2, 2015 at 9:03 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
It measures the zero results rate for 1 in 10 search requests via CirrusSearchUserTesting log that we used last quarter.
On Mon, Nov 2, 2015 at 6:01 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Define this "does it do anything?" test?
On 2 November 2015 at 19:58, Erik Bernhardson ebernhardson@wikimedia.org wrote:
Now that we have the feature deployed (behind a feature flag), and
have an
initial "does it do anything?" test going out today, along with an
upcoming
integration with our satisfaction metrics, we need to come up with
how will
will try to further move the needle forward.
For reference these are our Q2 goals:
Run A/B test for a feature that:
Uses a library to detect the language of a user's search query. Adjusts results to match that language.
Determine from A/B test results whether this feature is fit to push
to
production, with the aim to:
Improve search user satisfaction by 10% (from 15% to 16.5%). Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We
might
want to take into consideration which of these can be pre-tested
with our
relevancy lab work, such that we can prefer to work on things we
think will
move the needle the most. I'm really not sure which of these to push
forward
on, so let us know which you think can have the most impact, or
where the
expected impact could be measured with relevancy lab with minimal
work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
I think the final idea would be very similar to wikidata search.
Concerning wikidata search: could we reuse the code for language search and trigger the search on the backend? The JS code will send a query to wikidata.org (using action=query) and will generate a new search request on wikidata. Analyzing these logs will be hard because we won't be able to associate the original query and the query sent to wikidata.
By importing wikidata to the relevancy lab we could have a rough idea of the impact on ZRR.
Le 04/11/2015 21:07, Erik Bernhardson a écrit :
Taking the above into consideration and reviewing what we have in the brainstorming session, the set of idea seems to be the following:
Do language detection on more than just zero result queries, how about queries that only return 1 or 2 results
Seems useful and doable, but will only effect satisfaction and not the zero result rate. Still possibly worthwhile.
This should be relatively easy to test with relevancy lab
- Determine the language to search in via something other than language detection (headers, geolocation, etc)
- Working up a couple heuristics wouldn't be too hard. The webrequests table in hive has the accept language header and geolocation info as well as the query string, so we could extract a set of queries to test with
- Integrate wikidata search
This looks to be https://en.wikipedia.org/wiki/MediaWiki:Wdsearch.js
We could integrate that more directly, can't be tested by relevancy lab. It is basically just an additional set of results below the existing results.
Would need a significant cleanup to pass code review, but it's not particularly hard to do
- Translate the query from the provided language into the language of the wiki being searched on
This seems "very hard". Not only do we have to correctly detect the language the user input, but then we have to translate that into a second language
The CX service might be able to provide us a translation endpoint that works with whatever they are currently using, but will likely have high latency. Our inability (currently) to do async requests in php makes it harder to hide that latency.
Build an index that contains the titles from all wikis, but not much else. This could be used to suggest the user search on other wikis (or to inform the code that does actual searches on other wikis)
- This could be somewhat tested in relevancy lab, but first we would have to build something to actually combine all the titles into the same index.
I think any of the top three could be worked on, the first and the second can be validated through relevancy lab. The third takes a completely different approach and is not easily testable outside of production, but may be useful. The fourth is "very hard" and i think we should leave it alone for now. The fifth and final idea was only put forth once, but is interesting. I'm not sure how valuable it would be though.
On Tue, Nov 3, 2015 at 3:55 PM, Erik Bernhardson <ebernhardson@wikimedia.org mailto:ebernhardson@wikimedia.org> wrote:
In terms of user language data we have, within the webrequests table in hive we have the accept language header and we have geolocation information. This table also contains the query strings so we can extract the exact search terms and feed that information into relevancy lab. On Tue, Nov 3, 2015 at 3:29 PM, Kevin Smith <ksmith@wikimedia.org <mailto:ksmith@wikimedia.org>> wrote: So do we think we should favor the "try to guess the user's language(s)" item over others that would benefit from the relevance lab? Are there steps we could/should take in advance, such as analyzing whatever user language data we have, or instrumenting to get more if we don't have enough? Kevin Smith Agile Coach, Wikimedia Foundation / / On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones <tjones@wikimedia.org <mailto:tjones@wikimedia.org>> wrote: Sorry I didn't respond to this sooner! I really like the idea of trying to detect what languages the user can read, and searching in (a subset of) those. This wouldn't benefit from relevance lab testing, though. It'll need to be measured against the user satisfaction metric. (BTW, Do we have a sense of how many users have info we can detect for this?) I think the biggest problem with language detection is the quality of the language detector. The Elastic Search plugin we tested has a Romanian fetish when run on our queries (Erik got about 38% Romanian on 100K enwiki searches, which is crazy, and I got 0% accuracy for Romanian on my much smaller tagged corpus of failed (zero results) queries to enwiki). Most of the time, I would expect queries sent to the wrong wiki to fail (though there are some exceptions)—but a query in English that does get hits in rowiki is going to just look wrong most of the time. There are several proposals for improving language detection in the etherpad, and we can work on them in parallel, since any given one could be better than any other one. (We don't want to make 100 of them, but a few to test and compare would be nice—there may also be reasonable speed/accuracy tradeoffs to be made, e.g., 2% decrease in accuracy for 2x speed is a good deal.) We need training and evaluation data. I see a few ways of getting it. The easy, lower-quality way is just take queries from a given wiki and assume they are in the language in question (i.e., eswiki queries are in Spanish). Easy, not 100% accurate, unlimited supply. The hard, higher-quality way is to hand annotate a corpus of queries. This is slow, but doable. I can do on the order of 1000 queries in a day—more if I were less accurate and more willing to toss stuff into the junk pile. I couldn't do it for a week straight, though, without going crazy. A possible middle of the road approach would be to create a feedback loop and run detectors on our training data and review and remove items that are not in the desired language (we could also start by filtering things that are not in the right character set, like removing all Arabic, Cyrillic, and Chinese from enwiki, frwiki, and eswiki queries). If we want thousands of hand-annotated queries, we need to get annotating! I think we can use the relevance lab to help evaluate a language detector (at least with respect to zero results rate). We could run the detector against a pile of zero-results queries, then group the queries by detected language, and run them against the relevant wiki (if we have room in labs for the indexes, and we update the relevance lab tools to support choosing a target wiki to search). We wouldn't be comparing "before" and "after", but just measuring the zero results rate against the target wiki. As any time we're using zero-results rate, there's no guarantee that we'll be giving good results, just results (e.g., "unix time stamp" queries with English words fail on enwiki but sometimes work on zhwiki for some reason, but that's not really better.) I'm somewhat worried about being able to reduce the targeted zero results rate by 10%. In my test[1], only 12% of non-DOI zero-results queries were "in a language", and only about a third got results when searched in the "correct" (human-determined) wiki. I didn't filter bots other than the DOI bot, and some non-language queries (e.g., names) might get results in another wiki, but there may not be enough wiggle room. There's a lot of junk in other languages, too, but maybe filtering bots will help more than I dare presume. [1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_Searching#Perfect_identification.2C_ignoring_non-language_queries <https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes/Cross_Language_Wiki_Searching#Perfect_identification.2C_ignoring_non-language_queries> Trey Jones Software Engineer, Discovery Wikimedia Foundation On Mon, Nov 2, 2015 at 9:03 PM, Erik Bernhardson <ebernhardson@wikimedia.org <mailto:ebernhardson@wikimedia.org>> wrote: It measures the zero results rate for 1 in 10 search requests via CirrusSearchUserTesting log that we used last quarter. On Mon, Nov 2, 2015 at 6:01 PM, Oliver Keyes <okeyes@wikimedia.org <mailto:okeyes@wikimedia.org>> wrote: Define this "does it do anything?" test? On 2 November 2015 at 19:58, Erik Bernhardson <ebernhardson@wikimedia.org <mailto:ebernhardson@wikimedia.org>> wrote: > Now that we have the feature deployed (behind a feature flag), and have an > initial "does it do anything?" test going out today, along with an upcoming > integration with our satisfaction metrics, we need to come up with how will > will try to further move the needle forward. > > For reference these are our Q2 goals: > > Run A/B test for a feature that: > > Uses a library to detect the language of a user's search query. > Adjusts results to match that language. > > Determine from A/B test results whether this feature is fit to push to > production, with the aim to: > > Improve search user satisfaction by 10% (from 15% to 16.5%). > Reduce zero results rate for non-automata search queries by 10%. > > We brainstormed a number of possibilities here: > > https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming > > > We now need to decide which of these ideas we should prioritize. We might > want to take into consideration which of these can be pre-tested with our > relevancy lab work, such that we can prefer to work on things we think will > move the needle the most. I'm really not sure which of these to push forward > on, so let us know which you think can have the most impact, or where the > expected impact could be measured with relevancy lab with minimal work. > > > > _______________________________________________ > discovery mailing list > discovery@lists.wikimedia.org <mailto:discovery@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/discovery > -- Oliver Keyes Count Logula Wikimedia Foundation _______________________________________________ discovery mailing list discovery@lists.wikimedia.org <mailto:discovery@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/discovery _______________________________________________ discovery mailing list discovery@lists.wikimedia.org <mailto:discovery@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/discovery _______________________________________________ discovery mailing list discovery@lists.wikimedia.org <mailto:discovery@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/discovery _______________________________________________ discovery mailing list discovery@lists.wikimedia.org <mailto:discovery@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Hi!
Concerning wikidata search: could we reuse the code for language search and trigger the search on the backend?
We probably could, though I'm not sure if a little hack we did would work with Wikidata - Wikidata overrides a lot of stuff that regular wikis do not, and our config hack may not be enough. Needs checking.
For shorter term, doing it client-side may be faster as Wdsearch does.
Also, what about performance considerations? Right now I imagine traffic for wikidata search and other APIs is less than what we get for en.wikipedia but if we make searches go to both will it be ok?
The JS code will send a query to wikidata.org (using action=query) and will generate a new search request on wikidata. Analyzing these logs will be hard because we won't be able to associate the original query and the query sent to wikidata.
Can't we mark them somehow, like with prov= or other things? Maybe this mark plus temporal proximity plus query term match should enable grouping them? Not sure how hard it is in practice, analytics people please correct.
I agree that the top 3 are straightforward and seem likely to be beneficial. Translation is definitely very hard, and finding license-compatible libraries that are effective enough and cover enough languages seems daunting. And speed might also be a big concern. Given the scope of non-English queries on enwiki, I don't think it's worth it right now.
The last one is new to me. I kinda like it in theory, though I'll have to mull it over for a while. Could we test it without building the combined index if that's prohibitive by running intitle queries or some such against the top N likely useful indexes and seeing what hit rate we get?
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Wed, Nov 4, 2015 at 3:07 PM, Erik Bernhardson <ebernhardson@wikimedia.org
wrote:
Taking the above into consideration and reviewing what we have in the brainstorming session, the set of idea seems to be the following:
Do language detection on more than just zero result queries, how about queries that only return 1 or 2 results
- Seems useful and doable, but will only effect satisfaction and not
the zero result rate. Still possibly worthwhile.
- This should be relatively easy to test with relevancy lab
- Determine the language to search in via something other than
language detection (headers, geolocation, etc)
- Working up a couple heuristics wouldn't be too hard. The webrequests
table in hive has the accept language header and geolocation info as well as the query string, so we could extract a set of queries to test with
- Integrate wikidata search
This looks to be https://en.wikipedia.org/wiki/MediaWiki:Wdsearch.js
We could integrate that more directly, can't be tested by relevancy
lab. It is basically just an additional set of results below the existing results.
- Would need a significant cleanup to pass code review, but it's not
particularly hard to do
- Translate the query from the provided language into the language of
the wiki being searched on
- This seems "very hard". Not only do we have to correctly detect the
language the user input, but then we have to translate that into a second language
- The CX service might be able to provide us a translation endpoint
that works with whatever they are currently using, but will likely have high latency. Our inability (currently) to do async requests in php makes it harder to hide that latency.
Build an index that contains the titles from all wikis, but not much else. This could be used to suggest the user search on other wikis (or to inform the code that does actual searches on other wikis)
- This could be somewhat tested in relevancy lab, but first we would
have to build something to actually combine all the titles into the same index.
I think any of the top three could be worked on, the first and the second can be validated through relevancy lab. The third takes a completely different approach and is not easily testable outside of production, but may be useful. The fourth is "very hard" and i think we should leave it alone for now. The fifth and final idea was only put forth once, but is interesting. I'm not sure how valuable it would be though.
On Tue, Nov 3, 2015 at 3:55 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
In terms of user language data we have, within the webrequests table in hive we have the accept language header and we have geolocation information. This table also contains the query strings so we can extract the exact search terms and feed that information into relevancy lab.
On Tue, Nov 3, 2015 at 3:29 PM, Kevin Smith ksmith@wikimedia.org wrote:
So do we think we should favor the "try to guess the user's language(s)" item over others that would benefit from the relevance lab? Are there steps we could/should take in advance, such as analyzing whatever user language data we have, or instrumenting to get more if we don't have enough?
Kevin Smith Agile Coach, Wikimedia Foundation
On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones tjones@wikimedia.org wrote:
Sorry I didn't respond to this sooner!
I really like the idea of trying to detect what languages the user can read, and searching in (a subset of) those. This wouldn't benefit from relevance lab testing, though. It'll need to be measured against the user satisfaction metric. (BTW, Do we have a sense of how many users have info we can detect for this?)
I think the biggest problem with language detection is the quality of the language detector. The Elastic Search plugin we tested has a Romanian fetish when run on our queries (Erik got about 38% Romanian on 100K enwiki searches, which is crazy, and I got 0% accuracy for Romanian on my much smaller tagged corpus of failed (zero results) queries to enwiki). Most of the time, I would expect queries sent to the wrong wiki to fail (though there are some exceptions)—but a query in English that does get hits in rowiki is going to just look wrong most of the time.
There are several proposals for improving language detection in the etherpad, and we can work on them in parallel, since any given one could be better than any other one. (We don't want to make 100 of them, but a few to test and compare would be nice—there may also be reasonable speed/accuracy tradeoffs to be made, e.g., 2% decrease in accuracy for 2x speed is a good deal.)
We need training and evaluation data. I see a few ways of getting it. The easy, lower-quality way is just take queries from a given wiki and assume they are in the language in question (i.e., eswiki queries are in Spanish). Easy, not 100% accurate, unlimited supply. The hard, higher-quality way is to hand annotate a corpus of queries. This is slow, but doable. I can do on the order of 1000 queries in a day—more if I were less accurate and more willing to toss stuff into the junk pile. I couldn't do it for a week straight, though, without going crazy. A possible middle of the road approach would be to create a feedback loop and run detectors on our training data and review and remove items that are not in the desired language (we could also start by filtering things that are not in the right character set, like removing all Arabic, Cyrillic, and Chinese from enwiki, frwiki, and eswiki queries). If we want thousands of hand-annotated queries, we need to get annotating!
I think we can use the relevance lab to help evaluate a language detector (at least with respect to zero results rate). We could run the detector against a pile of zero-results queries, then group the queries by detected language, and run them against the relevant wiki (if we have room in labs for the indexes, and we update the relevance lab tools to support choosing a target wiki to search). We wouldn't be comparing "before" and "after", but just measuring the zero results rate against the target wiki. As any time we're using zero-results rate, there's no guarantee that we'll be giving good results, just results (e.g., "unix time stamp" queries with English words fail on enwiki but sometimes work on zhwiki for some reason, but that's not really better.)
I'm somewhat worried about being able to reduce the targeted zero results rate by 10%. In my test[1], only 12% of non-DOI zero-results queries were "in a language", and only about a third got results when searched in the "correct" (human-determined) wiki. I didn't filter bots other than the DOI bot, and some non-language queries (e.g., names) might get results in another wiki, but there may not be enough wiggle room. There's a lot of junk in other languages, too, but maybe filtering bots will help more than I dare presume.
[1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_S...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Nov 2, 2015 at 9:03 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
It measures the zero results rate for 1 in 10 search requests via CirrusSearchUserTesting log that we used last quarter.
On Mon, Nov 2, 2015 at 6:01 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Define this "does it do anything?" test?
On 2 November 2015 at 19:58, Erik Bernhardson ebernhardson@wikimedia.org wrote: > Now that we have the feature deployed (behind a feature flag), and have an > initial "does it do anything?" test going out today, along with an upcoming > integration with our satisfaction metrics, we need to come up with how will > will try to further move the needle forward. > > For reference these are our Q2 goals: > > Run A/B test for a feature that: > > Uses a library to detect the language of a user's search query. > Adjusts results to match that language. > > Determine from A/B test results whether this feature is fit to push to > production, with the aim to: > > Improve search user satisfaction by 10% (from 15% to 16.5%). > Reduce zero results rate for non-automata search queries by 10%. > > We brainstormed a number of possibilities here: > > https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming > > > We now need to decide which of these ideas we should prioritize. We might > want to take into consideration which of these can be pre-tested with our > relevancy lab work, such that we can prefer to work on things we think will > move the needle the most. I'm really not sure which of these to push forward > on, so let us know which you think can have the most impact, or where the > expected impact could be measured with relevancy lab with minimal work. > > > > _______________________________________________ > discovery mailing list > discovery@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/discovery >
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
On Wed, Nov 4, 2015 at 9:07 PM, Erik Bernhardson ebernhardson@wikimedia.org wrote:
Integrate wikidata search
This looks to be https://en.wikipedia.org/wiki/MediaWiki:Wdsearch.js
We could integrate that more directly, can't be tested by relevancy lab. It is basically just an additional set of results below the existing results.
Would need a significant cleanup to pass code review, but it's not particularly hard to do
I don't know what exactly you want to do but please keep in mind the article placeholder we are working on. I showed this to Dan, Stas and Wes already.
Cheers Lydia
Replies inline
On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones tjones@wikimedia.org wrote:
Sorry I didn't respond to this sooner!
I really like the idea of trying to detect what languages the user can read, and searching in (a subset of) those. This wouldn't benefit from relevance lab testing, though. It'll need to be measured against the user satisfaction metric. (BTW, Do we have a sense of how many users have info we can detect for this?)
I think the biggest problem with language detection is the quality of the language detector. The Elastic Search plugin we tested has a Romanian fetish when run on our queries (Erik got about 38% Romanian on 100K enwiki searches, which is crazy, and I got 0% accuracy for Romanian on my much smaller tagged corpus of failed (zero results) queries to enwiki). Most of the time, I would expect queries sent to the wrong wiki to fail (though there are some exceptions)—but a query in English that does get hits in rowiki is going to just look wrong most of the time.
There are several proposals for improving language detection in the etherpad, and we can work on them in parallel, since any given one could be better than any other one. (We don't want to make 100 of them, but a few to test and compare would be nice—there may also be reasonable speed/accuracy tradeoffs to be made, e.g., 2% decrease in accuracy for 2x speed is a good deal.)
My worry here is we would then need to productionize it. Several of the options i see are basically libraries that we would have to build a service (or ES plugin) around. I do think we should investigate this and decide if the effort to productionize is worth the impact we are able to estimate in relevance lab.
We need training and evaluation data. I see a few ways of getting it. The
easy, lower-quality way is just take queries from a given wiki and assume they are in the language in question (i.e., eswiki queries are in Spanish). Easy, not 100% accurate, unlimited supply. The hard, higher-quality way is to hand annotate a corpus of queries. This is slow, but doable. I can do on the order of 1000 queries in a day—more if I were less accurate and more willing to toss stuff into the junk pile. I couldn't do it for a week straight, though, without going crazy. A possible middle of the road approach would be to create a feedback loop and run detectors on our training data and review and remove items that are not in the desired language (we could also start by filtering things that are not in the right character set, like removing all Arabic, Cyrillic, and Chinese from enwiki, frwiki, and eswiki queries). If we want thousands of hand-annotated queries, we need to get annotating!
This is probably the biggest sticking point. Another random idea: We have speakers of several languages on the team and in the foundation (as in, under NDA and can review queries that are PII), would it be enough to grab example queries from wiki's of the correct language and then have someone that knows the language filter through them and delete nonsensical / wrong language queries? I'm guessing this would go faster, but not sure it's as valuable.
I think we can use the relevance lab to help evaluate a language detector (at least with respect to zero results rate). We could run the detector against a pile of zero-results queries, then group the queries by detected language, and run them against the relevant wiki (if we have room in labs for the indexes, and we update the relevance lab tools to support choosing a target wiki to search). We wouldn't be comparing "before" and "after", but just measuring the zero results rate against the target wiki. As any time we're using zero-results rate, there's no guarantee that we'll be giving good results, just results (e.g., "unix time stamp" queries with English words fail on enwiki but sometimes work on zhwiki for some reason, but that's not really better.)
I'm somewhat worried about being able to reduce the targeted zero results rate by 10%. In my test[1], only 12% of non-DOI zero-results queries were "in a language", and only about a third got results when searched in the "correct" (human-determined) wiki. I didn't filter bots other than the DOI bot, and some non-language queries (e.g., names) might get results in another wiki, but there may not be enough wiggle room. There's a lot of junk in other languages, too, but maybe filtering bots will help more than I dare presume.
I'm also worried about that portion, but perhaps a nuanced reading could help us? If a 10% increase in satisfaction is 15% -> 16.5%, then a 10% reduction in ZRR is 30% -> 27%. We don't yet have the numbers for non-automata so it's harder to say what exactly it is, but we finally have the data into hadoop which should make it possible to determine non-automata related issues.
[1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_S...
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Mon, Nov 2, 2015 at 9:03 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
It measures the zero results rate for 1 in 10 search requests via CirrusSearchUserTesting log that we used last quarter.
On Mon, Nov 2, 2015 at 6:01 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Define this "does it do anything?" test?
On 2 November 2015 at 19:58, Erik Bernhardson ebernhardson@wikimedia.org wrote:
Now that we have the feature deployed (behind a feature flag), and
have an
initial "does it do anything?" test going out today, along with an
upcoming
integration with our satisfaction metrics, we need to come up with how
will
will try to further move the needle forward.
For reference these are our Q2 goals:
Run A/B test for a feature that:
Uses a library to detect the language of a user's search query. Adjusts results to match that language.
Determine from A/B test results whether this feature is fit to push to production, with the aim to:
Improve search user satisfaction by 10% (from 15% to 16.5%). Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We
might
want to take into consideration which of these can be pre-tested with
our
relevancy lab work, such that we can prefer to work on things we think
will
move the needle the most. I'm really not sure which of these to push
forward
on, so let us know which you think can have the most impact, or where
the
expected impact could be measured with relevancy lab with minimal work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Oliver Keyes Count Logula Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Trimming down my reply to certain topics...
On Wed, Nov 4, 2015 at 3:20 PM, Erik Bernhardson <ebernhardson@wikimedia.org
wrote:
On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones tjones@wikimedia.org wrote:
There are several proposals for improving language detection in the etherpad, and we can work on them in parallel
My worry here is we would then need to productionize it. Several of the options i see are basically libraries that we would have to build a service (or ES plugin) around. I do think we should investigate this and decide if the effort to productionize is worth the impact we are able to estimate in relevance lab.
Yep—I always had language detection and translation as use cases in mind when thinking about the relevance lab. We can test a lot of stuff without productionizing it, which means it's less work to try stuff out and we don't have to commit early.
We need training and evaluation data.
This is probably the biggest sticking point. Another random idea: We have speakers of several languages on the team and in the foundation (as in, under NDA and can review queries that are PII), would it be enough to grab example queries from wiki's of the correct language and then have someone that knows the language filter through them and delete nonsensical / wrong language queries? I'm guessing this would go faster, but not sure it's as valuable.
This is a good idea if people are willing to do it, and it's faster and easier if you have only two buckets ("this language" and "not this language") because anything you don't recognize automatically goes into "not this language". You don't have to be a great speaker of the language to do a good job, either.
We also need to think about whether we want general language identification, or if we want to tailor it per wiki for better results. At the most course grained, think "Romanian" on enwiki vs rowiki. But there is also the matter of what languages actually appear in queries on each wiki. So, should we limit to the 10 most common non-English query languages on enwiki? (So we can correctly say "your query is in X but didn't get results on X wiki"?) Or the 10 most likely to get results on the right wiki? (So we can give more results.) Limiting the scope limits the data we need to collect, and increases precision (and probably recall) for enwiki, but the resulting detector can't be used on other wikis (and probably can't be used without modification on other wikis that are in English!), though the training data can be reused.
We should probably talk through this a bit more.
I'm somewhat worried about being able to reduce the targeted zero results
rate by 10%. In my test, only 12% of non-DOI zero-results queries were "in a language", and only about a third got results when searched in the "correct" (human-determined) wiki. I didn't filter bots other than the DOI bot, and some non-language queries (e.g., names) might get results in another wiki, but there may not be enough wiggle room. There's a lot of junk in other languages, too, but maybe filtering bots will help more than I dare presume.
I'm also worried about that portion, but perhaps a nuanced reading could
help us? If a 10% increase in satisfaction is 15% -> 16.5%, then a 10% reduction in ZRR is 30% -> 27%. We don't yet have the numbers for non-automata so it's harder to say what exactly it is, but we finally have the data into hadoop which should make it possible to determine non-automata related issues.
Yeah, we need to be able to effectively sample what we want to affect so we can gauge how well anything we try actually works.
Trey Jones Software Engineer, Discovery Wikimedia Foundation
Summarising this discussion, it seems like the path forward which would reap the most rewards is as follows:
1. Finish the MVP of the relevance lab; right now we can only test zero results rate for any given experiment, and the lab will help us also test result relevance. 2. Start writing tests to switch out the language detector used in the first test with alternative ones, to see if they're better - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 3. Write test to use accept-language header as a heuristic to do language switching (rather than language detection) - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 4. Expand original language switching test to also switch if there are "few" results (let's say "few" = 3 or fewer). - Does not really affect zero results rate; this is dependent on relevance lab
Any objections to this course of action? I plan to file tasks for these mid-Monday morning.
Thanks, Dan
On 2 November 2015 at 16:58, Erik Bernhardson ebernhardson@wikimedia.org wrote:
Now that we have the feature deployed (behind a feature flag), and have an initial "does it do anything?" test going out today, along with an upcoming integration with our satisfaction metrics, we need to come up with how will will try to further move the needle forward.
For reference these are our Q2 goals:
- Run A/B test for a feature that:
- Uses a library to detect the language of a user's search query.
- Adjusts results to match that language.
- Determine from A/B test results whether this feature is fit to push
to production, with the aim to: - Improve search user satisfaction by 10% (from 15% to 16.5%). - Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We might want to take into consideration which of these can be pre-tested with our relevancy lab work, such that we can prefer to work on things we think will move the needle the most. I'm really not sure which of these to push forward on, so let us know which you think can have the most impact, or where the expected impact could be measured with relevancy lab with minimal work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Seems reasonable to me. I'm not sure what to do with 1 & 2 yet, so I've started pulling queries out of hive for 3 (the accept-language stuff).
Erik B.
On Sun, Nov 8, 2015 at 9:51 PM, Dan Garry dgarry@wikimedia.org wrote:
Summarising this discussion, it seems like the path forward which would reap the most rewards is as follows:
- Finish the MVP of the relevance lab; right now we can only test
zero results rate for any given experiment, and the lab will help us also test result relevance. 2. Start writing tests to switch out the language detector used in the first test with alternative ones, to see if they're better - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 3. Write test to use accept-language header as a heuristic to do language switching (rather than language detection) - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 4. Expand original language switching test to also switch if there are "few" results (let's say "few" = 3 or fewer). - Does not really affect zero results rate; this is dependent on relevance lab
Any objections to this course of action? I plan to file tasks for these mid-Monday morning.
Thanks, Dan
On 2 November 2015 at 16:58, Erik Bernhardson ebernhardson@wikimedia.org wrote:
Now that we have the feature deployed (behind a feature flag), and have an initial "does it do anything?" test going out today, along with an upcoming integration with our satisfaction metrics, we need to come up with how will will try to further move the needle forward.
For reference these are our Q2 goals:
- Run A/B test for a feature that:
- Uses a library to detect the language of a user's search query.
- Adjusts results to match that language.
- Determine from A/B test results whether this feature is fit to push
to production, with the aim to: - Improve search user satisfaction by 10% (from 15% to 16.5%). - Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We might want to take into consideration which of these can be pre-tested with our relevancy lab work, such that we can prefer to work on things we think will move the needle the most. I'm really not sure which of these to push forward on, so let us know which you think can have the most impact, or where the expected impact could be measured with relevancy lab with minimal work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
I thought the MVP of the relevance lab could only test zero-results. Isn't there a fair bit more effort required for it to also be able to test some measure of "relevance?
Hopefully I'm mistaken.
Kevin Smith Agile Coach, Wikimedia Foundation
On Mon, Nov 9, 2015 at 8:53 AM, Erik Bernhardson <ebernhardson@wikimedia.org
wrote:
Seems reasonable to me. I'm not sure what to do with 1 & 2 yet, so I've started pulling queries out of hive for 3 (the accept-language stuff).
Erik B.
On Sun, Nov 8, 2015 at 9:51 PM, Dan Garry dgarry@wikimedia.org wrote:
Summarising this discussion, it seems like the path forward which would reap the most rewards is as follows:
- Finish the MVP of the relevance lab; right now we can only test
zero results rate for any given experiment, and the lab will help us also test result relevance. 2. Start writing tests to switch out the language detector used in the first test with alternative ones, to see if they're better - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 3. Write test to use accept-language header as a heuristic to do language switching (rather than language detection) - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 4. Expand original language switching test to also switch if there are "few" results (let's say "few" = 3 or fewer). - Does not really affect zero results rate; this is dependent on relevance lab
Any objections to this course of action? I plan to file tasks for these mid-Monday morning.
Thanks, Dan
On 2 November 2015 at 16:58, Erik Bernhardson <ebernhardson@wikimedia.org
wrote:
Now that we have the feature deployed (behind a feature flag), and have an initial "does it do anything?" test going out today, along with an upcoming integration with our satisfaction metrics, we need to come up with how will will try to further move the needle forward.
For reference these are our Q2 goals:
- Run A/B test for a feature that:
- Uses a library to detect the language of a user's search query.
- Adjusts results to match that language.
- Determine from A/B test results whether this feature is fit to
push to production, with the aim to: - Improve search user satisfaction by 10% (from 15% to 16.5%). - Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We might want to take into consideration which of these can be pre-tested with our relevancy lab work, such that we can prefer to work on things we think will move the needle the most. I'm really not sure which of these to push forward on, so let us know which you think can have the most impact, or where the expected impact could be measured with relevancy lab with minimal work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
The current relevance lab stuff can kinda sorta do more than zero results, but it does nothing fancy with it. It basically pretty prints the json results and outputs html diffs of the old vs new email results. Currently the summary surfaces a couple of the diffs that had the largest change.
I'm not sure how far things would have to be taken to do full relevance comparisons, essentially the annotated corpus iiuc. I think there is some middle ground but I'm not sure where it is. On Nov 9, 2015 9:59 AM, "Kevin Smith" ksmith@wikimedia.org wrote:
I thought the MVP of the relevance lab could only test zero-results. Isn't there a fair bit more effort required for it to also be able to test some measure of "relevance?
Hopefully I'm mistaken.
Kevin Smith Agile Coach, Wikimedia Foundation
On Mon, Nov 9, 2015 at 8:53 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
Seems reasonable to me. I'm not sure what to do with 1 & 2 yet, so I've started pulling queries out of hive for 3 (the accept-language stuff).
Erik B.
On Sun, Nov 8, 2015 at 9:51 PM, Dan Garry dgarry@wikimedia.org wrote:
Summarising this discussion, it seems like the path forward which would reap the most rewards is as follows:
- Finish the MVP of the relevance lab; right now we can only test
zero results rate for any given experiment, and the lab will help us also test result relevance. 2. Start writing tests to switch out the language detector used in the first test with alternative ones, to see if they're better - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 3. Write test to use accept-language header as a heuristic to do language switching (rather than language detection) - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 4. Expand original language switching test to also switch if there are "few" results (let's say "few" = 3 or fewer). - Does not really affect zero results rate; this is dependent on relevance lab
Any objections to this course of action? I plan to file tasks for these mid-Monday morning.
Thanks, Dan
On 2 November 2015 at 16:58, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
Now that we have the feature deployed (behind a feature flag), and have an initial "does it do anything?" test going out today, along with an upcoming integration with our satisfaction metrics, we need to come up with how will will try to further move the needle forward.
For reference these are our Q2 goals:
- Run A/B test for a feature that:
- Uses a library to detect the language of a user's search query.
- Adjusts results to match that language.
- Determine from A/B test results whether this feature is fit to
push to production, with the aim to: - Improve search user satisfaction by 10% (from 15% to 16.5%). - Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We might want to take into consideration which of these can be pre-tested with our relevancy lab work, such that we can prefer to work on things we think will move the needle the most. I'm really not sure which of these to push forward on, so let us know which you think can have the most impact, or where the expected impact could be measured with relevancy lab with minimal work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
On Mon, Nov 9, 2015 at 12:59 PM, Kevin Smith ksmith@wikimedia.org wrote:
I thought the MVP of the relevance lab could only test zero-results. Isn't there a fair bit more effort required for it to also be able to test some measure of "relevance?
Hopefully I'm mistaken.
Just including zero results rate was the baseline plan for the MVP, but I tried to make it maximally general, so it's a special case of an abstract metrics class, and I included 4 metrics, none of which are very complex: counting queries (so an empty string doesn't count as a "query"), zero results, top-5 diff ordered (i.e., the top 5 moved around or where replaced), and top-5 unordered (i.e., any of the top 5 were kicked out of the top 5, but shuffling doesn't matter).
There's a top-N ordered or unordered class, so changing or adding a different number is trivial. the main work for new metrics is writing a function that determines whether the metric applies, given the two JSON blobs; there's also some easy stuff like stating whether it's symmetrical, and setting some output parameters for examples.
So, metrics that don't require human thought (like, total results, or changes in order) are easy to add. Metrics that require human thought and external annotations (like, is this particular desired pageId included) are harder.
I'm thinking about language detection and preferred result annotations and how to include them and diff them.
I filed epics for each of these:
- T118278 https://phabricator.wikimedia.org/T118278 - EPIC: Run additional A/B tests for the language switching functionality, with different libraries to detect the query's language, and evaluate if the other libraries are better - T118280 https://phabricator.wikimedia.org/T118280 - EPIC: Run A/B test to determine whether using the accept-language header of the user to switch query languages is good or not - T118281 https://phabricator.wikimedia.org/T118281 - EPIC: Run original language switching A/B test, but switch languages if the user has fewer than n results (for some n), and determine if that's better or worse
We'll be breaking these epics down into specific actionables the sprint planning meeting in 25 minutes.
Thanks, Dan
On 8 November 2015 at 21:51, Dan Garry dgarry@wikimedia.org wrote:
Summarising this discussion, it seems like the path forward which would reap the most rewards is as follows:
- Finish the MVP of the relevance lab; right now we can only test
zero results rate for any given experiment, and the lab will help us also test result relevance. 2. Start writing tests to switch out the language detector used in the first test with alternative ones, to see if they're better - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 3. Write test to use accept-language header as a heuristic to do language switching (rather than language detection) - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 4. Expand original language switching test to also switch if there are "few" results (let's say "few" = 3 or fewer). - Does not really affect zero results rate; this is dependent on relevance lab
Any objections to this course of action? I plan to file tasks for these mid-Monday morning.
Thanks, Dan
On 2 November 2015 at 16:58, Erik Bernhardson ebernhardson@wikimedia.org wrote:
Now that we have the feature deployed (behind a feature flag), and have an initial "does it do anything?" test going out today, along with an upcoming integration with our satisfaction metrics, we need to come up with how will will try to further move the needle forward.
For reference these are our Q2 goals:
- Run A/B test for a feature that:
- Uses a library to detect the language of a user's search query.
- Adjusts results to match that language.
- Determine from A/B test results whether this feature is fit to push
to production, with the aim to: - Improve search user satisfaction by 10% (from 15% to 16.5%). - Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We might want to take into consideration which of these can be pre-tested with our relevancy lab work, such that we can prefer to work on things we think will move the needle the most. I'm really not sure which of these to push forward on, so let us know which you think can have the most impact, or where the expected impact could be measured with relevancy lab with minimal work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
There are now a whole bunch of tasks in the Cirrus board (and also some accompanying tasks in the Analysis board). They were thrown together quite quickly; if there's any uncertainty in their scope or you think it's unclear what it is you need to do to move on it, then please reach out or comment on the task and we can add extra definition.
Thanks, Dan
On 10 November 2015 at 09:06, Dan Garry dgarry@wikimedia.org wrote:
I filed epics for each of these:
- T118278 https://phabricator.wikimedia.org/T118278 - EPIC: Run
additional A/B tests for the language switching functionality, with different libraries to detect the query's language, and evaluate if the other libraries are better
- T118280 https://phabricator.wikimedia.org/T118280 - EPIC: Run A/B
test to determine whether using the accept-language header of the user to switch query languages is good or not
- T118281 https://phabricator.wikimedia.org/T118281 - EPIC: Run
original language switching A/B test, but switch languages if the user has fewer than n results (for some n), and determine if that's better or worse
We'll be breaking these epics down into specific actionables the sprint planning meeting in 25 minutes.
Thanks, Dan
On 8 November 2015 at 21:51, Dan Garry dgarry@wikimedia.org wrote:
Summarising this discussion, it seems like the path forward which would reap the most rewards is as follows:
- Finish the MVP of the relevance lab; right now we can only test
zero results rate for any given experiment, and the lab will help us also test result relevance. 2. Start writing tests to switch out the language detector used in the first test with alternative ones, to see if they're better - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 3. Write test to use accept-language header as a heuristic to do language switching (rather than language detection) - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 4. Expand original language switching test to also switch if there are "few" results (let's say "few" = 3 or fewer). - Does not really affect zero results rate; this is dependent on relevance lab
Any objections to this course of action? I plan to file tasks for these mid-Monday morning.
Thanks, Dan
On 2 November 2015 at 16:58, Erik Bernhardson <ebernhardson@wikimedia.org
wrote:
Now that we have the feature deployed (behind a feature flag), and have an initial "does it do anything?" test going out today, along with an upcoming integration with our satisfaction metrics, we need to come up with how will will try to further move the needle forward.
For reference these are our Q2 goals:
- Run A/B test for a feature that:
- Uses a library to detect the language of a user's search query.
- Adjusts results to match that language.
- Determine from A/B test results whether this feature is fit to
push to production, with the aim to: - Improve search user satisfaction by 10% (from 15% to 16.5%). - Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We might want to take into consideration which of these can be pre-tested with our relevancy lab work, such that we can prefer to work on things we think will move the needle the most. I'm really not sure which of these to push forward on, so let us know which you think can have the most impact, or where the expected impact could be measured with relevancy lab with minimal work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
The short version is that one epic (showing results from more than one wiki) seemed to be quick and easy to implement, so we'll take that directly to A/B testing. The other two will be run through the relevance lab to see if they are promising enough to take to A/B testing.
Kevin Smith Agile Coach, Wikimedia Foundation
On Tue, Nov 10, 2015 at 10:16 AM, Dan Garry dgarry@wikimedia.org wrote:
There are now a whole bunch of tasks in the Cirrus board (and also some accompanying tasks in the Analysis board). They were thrown together quite quickly; if there's any uncertainty in their scope or you think it's unclear what it is you need to do to move on it, then please reach out or comment on the task and we can add extra definition.
Thanks, Dan
On 10 November 2015 at 09:06, Dan Garry dgarry@wikimedia.org wrote:
I filed epics for each of these:
- T118278 https://phabricator.wikimedia.org/T118278 - EPIC: Run
additional A/B tests for the language switching functionality, with different libraries to detect the query's language, and evaluate if the other libraries are better
- T118280 https://phabricator.wikimedia.org/T118280 - EPIC: Run A/B
test to determine whether using the accept-language header of the user to switch query languages is good or not
- T118281 https://phabricator.wikimedia.org/T118281 - EPIC: Run
original language switching A/B test, but switch languages if the user has fewer than n results (for some n), and determine if that's better or worse
We'll be breaking these epics down into specific actionables the sprint planning meeting in 25 minutes.
Thanks, Dan
On 8 November 2015 at 21:51, Dan Garry dgarry@wikimedia.org wrote:
Summarising this discussion, it seems like the path forward which would reap the most rewards is as follows:
- Finish the MVP of the relevance lab; right now we can only test
zero results rate for any given experiment, and the lab will help us also test result relevance. 2. Start writing tests to switch out the language detector used in the first test with alternative ones, to see if they're better - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 3. Write test to use accept-language header as a heuristic to do language switching (rather than language detection) - This should affect the zero results rate, so lack of the relevance lab does not block this - This should also affect relevance (at least conceptually), so can be tested using the relevance lab also 4. Expand original language switching test to also switch if there are "few" results (let's say "few" = 3 or fewer). - Does not really affect zero results rate; this is dependent on relevance lab
Any objections to this course of action? I plan to file tasks for these mid-Monday morning.
Thanks, Dan
On 2 November 2015 at 16:58, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
Now that we have the feature deployed (behind a feature flag), and have an initial "does it do anything?" test going out today, along with an upcoming integration with our satisfaction metrics, we need to come up with how will will try to further move the needle forward.
For reference these are our Q2 goals:
- Run A/B test for a feature that:
- Uses a library to detect the language of a user's search query.
- Adjusts results to match that language.
- Determine from A/B test results whether this feature is fit to
push to production, with the aim to: - Improve search user satisfaction by 10% (from 15% to 16.5%). - Reduce zero results rate for non-automata search queries by 10%.
We brainstormed a number of possibilities here:
https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming
We now need to decide which of these ideas we should prioritize. We might want to take into consideration which of these can be pre-tested with our relevancy lab work, such that we can prefer to work on things we think will move the needle the most. I'm really not sure which of these to push forward on, so let us know which you think can have the most impact, or where the expected impact could be measured with relevancy lab with minimal work.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery