Discovery August 2015

discovery@lists.wikimedia.org

20 participants
21 discussions

[Wikimedia-search] Announcing a second round of A/B tests
by Oliver Keyes 20 Aug '15

20 Aug '15

Dear all, Last week we ran an A/B test over changes to how our search system provides results (and we look forward to sharing the results of that with you shortly). Today, we're launching a second A/B test. This test looks at the "phrase slop" setting within our search infrastructure, a thoroughly disgusting term that simply refers to the "distance" in words between a search query and a match. For example, the search term "Ben Folds Five" has a "phrase slop" of 0 to the match "Ben Folds Five". The search term "Ben Five" has a "phrase slop" of 1 - there is 1 word (Folds) in distance between the search term and the match. What we are going to do is experiment with altering the phrase slop settings for 3% of users. 1% will have a slop of 0 (the current default), 1% a slop of 1, and 1% a slop of 2. The hope is that by tweaking this setting we can measurably reduce the number of search queries that return 0 results by broadening the conditions under which something is considered a match. This search will run for a week, and kick off at 4pm PST. Once it's completed we'll share the results publicly, as is the norm for our A/B tests. Huge thanks go particularly to Trey Jones and David Causse for their feedback on what settings we could alter, Mikhail Popov for his work on experimental design, and Erik Bernhardson for both his feedback and turning around necessary changes to the search infrastructure on such short notice. For context, we began planning this test last Thursday morning - a one-week turnaround on design and implementation is phenomenal. Thanks, -- Oliver Keyes Count Logula Wikimedia Foundation

2 2

[Wikimedia-search] Primary ownership of Data Analysis projects
by Oliver Keyes 16 Aug '15

16 Aug '15

Hey all, Until recently there was only one analyst working with Discovery - me! Accordingly it was very easy to work out who was responsible for our various projects (KPIs, dashboarding, ad-hoc support): get Dan to tell Oliver to...etc, etc, etc. As people may have noticed we recently hired Mikhail Popov, who has been doing a fantastic job not only backstopping me on our work but also taking the lead on some chunks. This offers an opportunity to reduce the points of failure, here, and reward good work with responsibility (and therefore more work. Sorry, Mikhail. You should've thought of that before you demonstrated competence). Accordingly, we're mixing things up a bit. If you have questions relating to the KPIs (particularly the User Satisfaction stuff), our data-gathering backends, defining new metrics, or ad-hoc support, I remain your primary POC for questions. If you have questions regarding the existing search dashboards, setting up new ones within Discovery, or setting up new ones elsewhere, Mikhail is the person to talk to. It almost goes without saying, but to say it anyway: this does not change the existing process for asking for substantive help. If work needs to be done, put it in as a Phabricator task or talk to Dan about it and have /him/ do that, and he'll then apportion work broadly along these lines. The change here is for quick questions, transparency around who is doing what, and for the horrible edge case where something breaks in a hideous way and you need to talk to an expert about it Right Now. Happy Sunday, -- Oliver Keyes Count Logula Wikimedia Foundation

1 0

[Wikimedia-search] Help me !!! am from Syria
by hussein 16 Aug '15

16 Aug '15

- This mail is in HTML. Some elements may be ommited in plain text. - Assalamualikum, I am Hussein Ali from Syria, presently now with the United Nations on asylum. I got your contact from a web business directory on investment. Please I seek your assistance in the following ways: 1.To assist me look for a profitable business in your country (where I can invest to sustain my living until the political crisis in my country is over). 2. To assist me purchase a living home, .I have huge sum fifteen million us dollars in financial institution .Should there be a need for an evidence, or a prove of my seriousness and genuineness. I have a Certificate of Deposit as a prove of fund. Please assist me to come over to your country for resettlement and investment. I will compensate you greatly for this help. I am also ready to associate with a local partner, provided Your Government will give me a Residence Permit. Could you please send me an email on ( syriaoil.aleppo(a)gmail.com ) to enable me know you have received my email. Regards, Hussein Ali.

1 0

[Wikimedia-search] Completion suggester how to score articles
by David Causse 14 Aug '15

14 Aug '15

Hi, I'm working on a score function to rank suggestions with the new API we're building. I'd like to share with you some data. The initial score function uses the variables available today. If someone want to have a look you can find a dataset (from simplewiki) here[1]. And a small R script to play with the score here[2]. The score works correctly to discount pages that have high number of incoming links like small villages that link to each others but fails to discount "List articles" and "Date/Years" articles. I'm not sure how to deal with that. In the data set you'll find two columns named "good" and "very good", they are set to true when the article is flagged with Template:Good or Template:Very Good. Thanks! [1] https://people.wikimedia.org/~dcausse/simplewiki_score_vars.csv [2] https://github.com/nomoa/suggester-prototype/blob/master/score.R

1 0

[Wikimedia-search] One dashboard to rule them all
by James Douglas 11 Aug '15

11 Aug '15

*Background:* In the team leads sync today it was suggested that we consider using Google Calendar to track releases. In addition to this, we have several Phabricator boards, plus pages on Meta, Office, and Tech, plus some other stuff I'm probably forgetting. *Problem:* It's hard to keep track of everything that's going on, and it's hard to keep individual work in context of both short-term and long-term plans. *Proposal:* Ideally (in the spirit of lessons learned from the book Getting Things Done), there would be a single resource that we trust to encapsulate everything we need to know about Discovery from an engineering standpoint. Could we feasibly create such a dashboard? Alternatively (much more realistically), where might we create a root-level landing page to organize links to all the various tools that we use? *Related:* https://xkcd.com/927/

2 1

[Wikimedia-search] Fwd: Maximum search query length coming soon
by Dan Garry 11 Aug '15

11 Aug '15

Cross-posting from wikitech-l. Please discuss there. Dan ---------- Forwarded message ---------- From: Dan Garry <dgarry(a)wikimedia.org> Date: 10 August 2015 at 14:36 Subject: Maximum search query length coming soon To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Hello! The Search Team in the Discovery Department is implementing a maximum search query length <https://phabricator.wikimedia.org/T107947>. There are two main reasons to do this: 1. Extremely long queries are almost always gibberish from things like malfunctioning scrapers. These queries skew our statistics about the usefulness of our search. Implementing a limit will reduce the magnitude of skew. 2. Extremely long queries have a disproportionate impact on performance. On its own this isn't enough, but considering point 1 above, limiting them is unlikely to impact any actual users. Implementing a limit will improve performance. We've chosen a hard limit of 300 characters. If your query exceeds this, you will be told that your query exceeds the maximum length. Based on our analysis of typical query lengths <https://phabricator.wikimedia.org/T107947#1515387>, this change should impact almost nobody. If you think you'll be adversely affected, please reach out to us and we'll work with you to figure something out. Thanks! Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

1 0

[Wikimedia-search] Fwd: Discovery Department running A/B tests for search suggestions
by Dan Garry 07 Aug '15

07 Aug '15

Cross-posting from wikitech-l. If you have any questions or comments, please post them there. Thanks, Dan ---------- Forwarded message ---------- From: Dan Garry <dgarry(a)wikimedia.org> Date: 7 August 2015 at 13:19 Subject: Discovery Department running A/B tests for search suggestions To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Hello! As part of our goal to reduce the zero results rate <https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q1_Goals#Search>, the Discovery Department is currently running an A/B test to try different parameters for the search suggester. We're hoping that our new parameters will give users more suggestions without decreasing their quality. The reason we've chosen to tweak the suggestions is because of our recent work <https://phabricator.wikimedia.org/T105202> to automatically run queries for the user if they get zero results but have a suggestion. The purpose of this A/B test is to determine whether this has significant impact towards achieving our goal or not. This is the first A/B test that the Discovery Department has run, so we're still ironing out the process. We hope to run many more A/B tests in the future. For further information on this, please review the associated Phabricator task <https://phabricator.wikimedia.org/T108103>. If you have any questions, I'd be happy to answer them. Thanks, Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

1 0

[Wikimedia-search] Dashboard updates: KPI & linking
by Mikhail Popov 06 Aug '15

06 Aug '15

Pleased to announce that the dashboard now has a KPI module that should be the first thing y'all see when you go to http://searchdata.wmflabs.org/metrics/ - The currently-functional widgets (load time, zero results rate, api usage) adjust their visual style to reflect good or bad changes since yesterday. - The bar showing breakdown of API usage is staying here for now until we find a better place for it. One more thing! The individual dashboard tabs can now be linked to. So if you need to show somebody the zero results summary page, you can navigate to it and find a link at the bottom that you can copy and paste like this: http://searchdata.wmflabs.org/metrics/#failure_rate Cheers~ Mikhail -- *Mikhail Popov* // Data Scientist, Discovery <https://www.mediawiki.org/wiki/Wikimedia_Discovery> https://wikimediafoundation.org/ *Imagine a world in which every single human being can freely share in the **sum of all knowledge. That's our commitment.* Donate <https://donate.wikimedia.org/>.

6 8

[Wikimedia-search] Fwd: "Morelike" suggestions - the results are in!
by Dmitry Brant 05 Aug '15

05 Aug '15

moving to mobile-l, and cc Search & Discovery. ---------- Forwarded message ---------- From: Dmitry Brant <dbrant(a)wikimedia.org> Date: Wed, Jul 29, 2015 at 3:38 PM Subject: "Morelike" suggestions - the results are in! To: Internal communication for WMF Reading team < reading-wmf(a)lists.wikimedia.org> Hi all, For the last few weeks, we've had an A/B test in the Android app where we measure user engagement with the "read more" suggestions that we show at the bottom of each article. We display three suggestions for further reading, based on either (A) a plain full-text search query based on the title of the current article, or (B) a query using the "morelike" feature in CirrusSearch. And the winner is... (perhaps not entirely surprisingly) "morelike"! Users who saw suggestions based on "morelike" were over 20% more likely to click on one of the suggestions. Here's a quick analysis and chart of the data from the last 10 days: *https://docs.google.com/spreadsheets/d/1BFsrAcPgexQyNVemmJ3k3IX5rtPvJ_5vdYOyGgS5R6Y/edit?usp=sharing <https://docs.google.com/spreadsheets/d/1BFsrAcPgexQyNVemmJ3k3IX5rtPvJ_5vdYO…>* -Dmitry

4 4

[Wikimedia-search] Scaleable Event Systems recap
by Oliver Keyes 04 Aug '15

04 Aug '15

Heyo, Discovery team! (Analytics CCd) This is just a quick writeup of the Scaleable Event Systems meeting that Erik, Dan, Stas and I went to (although just from my perspective). For people not in the initial thread, this is a proposal to replace the internal architecture of EventLogging and similar services with Apache Kafka brokers (http://www.confluent.io/blog/stream-data-platform-1/ ). What that means in practice is that the current 1-2k events/second limit on EventLogging will disappear and we can stop worrying about sampling and accidentally bringing down the system. We can be a lot less cautious about our schemas and a lot less cautious about our sampling rate! It also offers up a lot of opportunities around streaming data and making it available in a layered fashion - while we don't want to explore that right now, I don't think, it's nice to have as an option when we better understand our search data and how we can safely distribute it. I'd like to thank the Analytics team, particularly Andrew, for putting this together; it was a super-helpful discussion to be in and this sort of product is precisely what I, at least, have been hoping for out of the AnEng brain trust. Full speed ahead! -- Oliver Keyes Count Logula Wikimedia Foundation

3 3

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery August 2015