Wikimedia-search August 2015

wikimedia-search@lists.wikimedia.org

20 participants
21 discussions

On frequency of A/B tests and peeking at the data early

by Mikhail Popov

Hi all, Last week we discussed our approach to A/B testing and we've decided to have a week (at least) between tests. A two-week-minimum cadence will give the analysis team enough time to thoroughly think about the experimental design of each test, as well as give the engineers enough time to implement it. Which is great because some of the changes we are planning to test are not trivial and we don't want to rush a test out and realize halfway through that we should have been tracking something we're not. We are also going to move away from doing initial analyses (analysis of the data from the morning of a launch) for practical and scientific reasons. Practical in the sense that we've been putting time and effort into getting preliminary results that are not representative of final results whatsoever while putting other work on the backburner. Scientific in the sense that peeking at the data mid-experiment is bad science: *Repeated significance testing always increases the rate of false positives, that is, you’ll think many insignificant results are significant (but not the other way around). The problem will be present if you ever find yourself “peeking” at the data and stopping an experiment that seems to be giving a significant result. The more you peek, the more your significance levels will be off. For example, if you peek at an ongoing experiment ten times, then what you think is 1% significance is actually just 5% significance.* – Evan Miller, How Not To Run An A/B Test <http://www.evanmiller.org/how-not-to-run-an-ab-test.html> In science, it's a problem called multiple comparisons. The more tests you perform, the more likely you are to see something where there is nothing. Going forward, we are going to wait until we have collected all the data before analyzing it. Cheers, Mikhail, Junior Swifty Discovery // The Swifties

8 years, 7 months

Fwd: [Engineering] Save the date: Gerrit Cleanup Day: Wed, Sep 23

by Kevin Smith

As a reminder, the entire Discovery department will be participating in this. As it gets closer, we can make sure everyone has a plan for how to use the day most productively. Kevin Smith Agile Coach, Wikimedia Foundation ---------- Forwarded message ---------- From: Andre Klapper <aklapper(a)wikimedia.org> Date: Mon, Aug 31, 2015 at 3:24 PM Subject: [Engineering] Save the date: Gerrit Cleanup Day: Wed, Sep 23 To: Development and Operations Engineers <engineering(a)lists.wikimedia.org> I'm happy to announce a Gerrit Cleanup Day on Wed, September 23. It's an experiment to reduce Wikimedia's code review backlog which hurts growing our long-term code contributor base. All dev/eng teams are supposed to join and use the day to primarily review recently submitted open Gerrit changesets without a review, focussing on volunteer contributions. Please save the date! The event is also in the "WMF Engineering" Google Calendar so you could copy it to your personal calendar. https://phabricator.wikimedia.org/T88531 provides more information, steps, links. Note it's still work in progress. Your questions and feedback are welcome. Thanks, andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/ _______________________________________________ Engineering mailing list Engineering(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering

8 years, 7 months

Tasks outside of Phabricator aren't real

by Mikhail Popov

Hi all, This is a gentle reminder that if your task doesn't exist as a Phabricator ticket, then the person/team who can get it done doesn't know there's an actual need for it. It allows us to avoid situations like "Person A said they think Person B is working on Thing X" meanwhile B has no idea X was even a thing. Thanks, Mikhail

8 years, 8 months

Completion suggestion API demo

by Erik Bernhardson

We have been working on a replacement autocompletion API that is more forgiving than a strict prefix search. The scoring algorithm's have a long way to go but we have the first run through of building the completion index for enwiki so i thought i would share: Here are a couple examples, feel free to change the text= around to other things. http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&limit… http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&limit… http://cirrus-browser-bot.wmflabs.org/w/api.php?action=cirrus-suggest&limit… Don't stress out too much over the scoring yet, we know it needs some work and have plans to integrate page view information in here to help more popular articles rise the the top. Erik B

8 years, 8 months

Measuring user user satisfaction while reducing it at the same time?

by Max Semenik

While doing CR for https://gerrit.wikimedia.org/r/#/c/232896/3/modules/ext.wikimediaEvents.sea… I came to have serious doubts about this approach. In brief, it attempts to track user satisfaction with search results by measuring how long do people stay on pages. It does that by appending fromsearch=1 to links for 0.5% of users. However, this results in page views being uncached and thus increasing HTML load time by a factor of 4-5 and, consequentially, kicking even short pages' first paint outside of comfort zone of 1 second - and that's measured from the office, with ping of 2-3 ms to ulsfo. My concern here is that as a result we're trying to measure the very metric we're screwing with, resulting in experiment being inaccurate. Can we come up with a way of measurement that's less intrusive or alter the requirements of the experiment? -- Best regards, Max Semenik ([[User:MaxSem]])

8 years, 8 months

Zero Results Rate—One Month Followup

by Trey Jones

Hey everyone, I've re-run my "big" wiki zero result rate numbers to see what has changed in the last month. The results are here: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Resul… Since I was only looking at the big 52 wikis (100K+ articles), the zero results rate is under 20% (good news), but it hasn't gone down in a month (bad news). I looked very briefly at the full text zero results rate for the rest of the wikis for yesterday. That zero results rate was 56.6%! Lots of hits from wikidata and itwiki-things! There were some DOI queries, but none of the other usual suspects. —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

8 years, 8 months

Academic paper comparing Wikipedia's search engine with natural language question search engines

by Tilman Bayer

FYI just in case it's of interest and hasn't shown up on the team's radar yet: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7194368 - paywalled, unfortunately. Quote from the abstract: "This paper discusses expressivity and accuracy of the By-Example Structured (BESt) Query paradigm implemented on the SWiPE system through the Wikipedia interface. We define an experimental setting based on the natural language questions made available by the QALD-4 challenge, in which we compare SWiPE against Xser, a state-of-the-art Question Answering system, and plain keyword search provided by the Wikipedia Search Engine. The experiments show that SWiPE outperforms the results provided by Wikipedia, and it also performs sensibly better than Xser, obtaining an overall 85% of totally correct answers vs. 68% of Xser." (For context, there's an earlier paper where they describe an earlier version of that SWiPE - "Search Wikipedia by example" - project: http://web.cs.ucla.edu/~zaniolo/papers/AtzoriZ12 ) -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

8 years, 8 months

Final results of the first A/B test

by Oliver Keyes

Hey all, Several weeks ago we ran an A/B test to try and decrease the number of searches on Wikipedia returning zero results. This consisted of a small config change that reduced the confidence needed for our systems to provide search results, along with a change to the smoothing algorithm used to bump the quality of the results now provided. Our intent was not only to reduce the zero results rate but also to prototype the actual process of A/B testing and identify issues we could fix to make future tests easier and more reliable - this was the first A/B test we had run. 5% of searches were registered as a control group, and an additional 5% subject to the reduced confidence and smoothing algorithm change. An initial analysis over the first day's 7m events was run on 7 August, and a final analysis looking at an entire week of data was completed yesterday. You can read the full results at https://github.com/wikimedia-research/SuggestItOff/blob/master/initial_anal… and https://github.com/wikimedia-research/SuggestItOff/blob/master/final_analys… Based on what we've seen we conclude that there is, at best, a negligible effect from this change - and it's hard to tell if there even is an effect at all. Accordingly, we recommend the default behaviour be used for all users, and the experiment disabled. This may sound like a failure, but it's actually not. For one thing, we've learned that the config variables here probably aren't our venue for dramatic changes - the defaults are pretty sensible. If we're looking for dramatic changes in the rate, we should be applying dramatic changes in the system's behaviour. In addition, we identified a lot of process issues we can fix for the next round of A/B tests, making it easier to analyse the data that comes in and making the results easier to rely on. These include: 1. Who the gods would destroy, they first bless with real-time analytics. The dramatic difference between the outcome of the initial and final analysis speaks partly to the small size of the effect seen and the power analysis issues mentoned below, but it's also a good reminder that a single day of data, however many datapoints it contains, is rarely the answer. User behaviour varies dramatically depending on the day of the week or the month of the year - a week should be seen as the minimum testing period for us to have any confidence in what we see. 2. Power analysis is a must-have. Our hypothesis for the negligible and totally varying size of the effect is simply the amount of data we had; when you're looking at millions upon millions of events for a pair of options, you're going to see patterns - because with enough data stared at for long enough, pretty much anything can happen. That doesn't mean it's /real/. In the future we need to be setting our sample size using proper, a-priori power analysis - or switching to Bayesian methods where this sort of problem doesn't appear. These are not new lessons to the org as a whole (at least, I hope not) but they are nice reminders, and I hope that sharing them allows us to start building up an org-wide understanding of how we A/B test and the costs of not doing things in a very deliberate way. Thanks, -- Oliver Keyes Count Logula Wikimedia Foundation

8 years, 8 months

Re: [Wikimedia-search] Academic paper comparing Wikipedia's search engine with natural language question search engines

by Tilman Bayer

On Tue, Aug 25, 2015 at 7:58 AM, Oliver Keyes <okeyes(a)wikimedia.org> wrote: > So it's a comparison of two search systems, neither of which we use? It's a comparison of three search systems, one of which is our in-built search, which they claim is inferior to their system in that setting ("The experiments show that SWiPE outperforms the results provided by Wikipedia"). -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

8 years, 8 months

Heads-up about a new work categorization pilot

by Kevin Smith

Hello Discovery, Management is requesting that foundation engineering teams consider tracking how much time we are spending on "maintenance" vs. "new work". As a first step in that direction, we have been asked to categorize our work into those two buckets. This should not require a lot of effort, and for most of you (non leads) won't require any effort at all, at least for now. But I wanted you to be aware of it. You'll start to see phab tasks tagged with one of two new tags, which will probably be #WorkTypeMaintenance and #WorkTypeNewFunctionality. You can ignore them in your day-to-day work. The idea is to experiment with this approach for a while, and then evaluate whether it should be continued, modified, or dropped. Kevin Smith Agile Coach, Wikimedia Foundation

8 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Wikimedia-search August 2015