Discovery December 2015

discovery@lists.wikimedia.org

20 participants
26 discussions

Portal A/B testing
by Oliver Keyes 10 Dec '15

10 Dec '15

Totally unrelated to my previous email, I promise. This is just me writing down my thinking on how A/B testing works, and how it applies to the portal (www.wikipedia.org) experiments and the schema we have deployed there. A/B testing is a common way of identifying if a proposed change to a piece of software is actually an improvement or not: it consists of taking a sample of users and dividing them into two groups, the "A" and "B" groups (hence the name). One group is consistently given the experimental change (the "test" group). One group is consistently given the default experience (the "control" group). Users are pseudorandomly sorted into each group, so that both groups are even. The end outcome for both groups is compared, and the change is successful if users in the test group are statistically significantly more likely to experience a better outcome than the users in the control group. When we put together the schema for the Portal we did it after months of experimenting with the Cirrus A/B tests, which means that we tried to structure it to take into account the lessons we learned there. We discovered that things were simpler the more fields you had; that maintaining a base population who were not participating in any tests was ideal for dashboarding. Accordingly the schema tracks every KPI we care about for the portal and contains a "cohort" field that indicates if someone is in the "A" group, the "B" group, or no group whatsoever - with the idea that most users at any one time would be in /no/ group and we could rely on that population for dashboarding! That way we can handle everything with one schema. So the things to remember when setting up Portal tests: 1. The test and control groups should be even; 2. The test and control group should (together) make up a very small chunk of the total people getting the logging. 10% combined, say. 3. The test and control group should both be represented with "cohort" values, with nothing (to produce a MySQL NULL) for the rest of the population. That way we can both test and dashboard simultaneously. -- Oliver Keyes Count Logula Wikimedia Foundation

3 2

Down with A/B testing!
by Oliver Keyes 10 Dec '15

10 Dec '15

The title is mostly to get you're attention, you know I like A/B testing. With that being said: For a quarter and a bit we've been running A/B tests. Doing so has been intensely time-consuming for both engineering and analysis, and at times it's felt like we're pushing changes out just to test them, rather than because we have reason to believe there will be dramatic improvements. These tests have produced, at best, mixed results. Many of the tests have not shown a substantial improvement in the metric we have been testing - the zero results rate. Those that have have not been deployed further because we cannot, from the ZRR alone, test the _utility_ of produced results: for that we need to A/B test against clickthroughs, or a satisfaction metric. So where do we go from here? In my mind, the ideal is that we stop A/B testing against the zero results rate. This doesn't mean we stop testing improvements: this means we build the relevance lab up and out and test the zero results rate against /that/. ZRR does not need user participation, it needs the participation of user *queries*: with the relevance lab we can consume user queries and test ideas against them at a fraction of the cost of a full A/B test. Instead, we use the A/B tests for the other component: the utility component. If something passes the Relevance Lab ring of fire, we A/B test it against clickthroughs: this will be rarer than "every two weeks" and so we can afford to spend some time making sure the test is A+ scientifically, and all our ducks are in a row. The result will be not only better tests, but a better impact for users, because we will actually be able to deploy the improvements we have worked on - something that has thus far escaped us due to attention being focused on deploying More Tests rather than completely validating the ones we have already deployed. Thoughts? -- Oliver Keyes Count Logula Wikimedia Foundation

2 2

Your order #0000263287 is approved
by America Airlines 09 Dec '15

09 Dec '15

Dear customer, Your payment was successfully processed, your credit card was charged. You can find your e-ticket in the attachment. Order details and e-ticket information: FLIGHT NUMBER - KZ255494 DATE & TIME - Dec 16 2015, 11:10 DEPARTING - Long Beach TOTAL PRICE - $ 610.00 Thanks for flying with America Airlines.

1 0

Bug on the 'Whose birthday is today?' SPARQL Query Examples
by Mathieu GINOD 09 Dec '15

09 Dec '15

Hi, I was playing with the SPARQL query examples. The query 'Whose birthday is today <https://www.mediawiki.org/wiki/Wikibase/Indexing/SPARQL_Query_Examples#Whos…>' always ends up in a QueryTimeoutException (Query deadline is expired). If I change the birthdate property P569 to the death date property P570, the query runs without throwing an exception. Is there something to do to have the birth query successfully executed? Thank you! Mathieu

1 0

Payment for driving on toll road, invoice #00394421
by E-ZPass Support 08 Dec '15

08 Dec '15

Notice to Appear, You have not paid for driving on a toll road. Please service your debt in the shortest possible time. You can review the invoice in the attachment. Kind regards, Jeremy Hardy, E-ZPass Manager.

1 0

Thinking about what Discovery works on in the future
by Tomasz Finc 08 Dec '15

08 Dec '15

As the Discovery team solidifies it's draft goals for Q3 https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q3_Goals#Disco… I'd like to start collecting input about future areas that we could explore. I've started a Draft RFC page on mw.org https://www.mediawiki.org/wiki/Wikimedia_Discovery/RFC To expand on some of ideas and to provide a space for additional ones. Come share your thoughts --tomasz

1 0

Language Detection Improvements! (preview)
by Trey Jones 08 Dec '15

08 Dec '15

Hey everyone, I originally started this thread on the internal mailing list, but I should've shared it on the public list. There have been a few replies, so I'll try to summarize everything so far and we can continue discussion here. For those following the old thread, new stuff starts at "*New Stuff!*" below. Sorry for the mixup. I wanted to share some good news on the language detection front. I've been messing around with TextCat, an n-gram-based language ID tool that's been around for a good while. (The version I've been using is in Really Old Perl, but as David pointed out, it's been ported to other languages (see below) and it would be easy for us to port to Python or PHP.) I trained fresh models on 59 languages based on queries (not random non-query text, but fairly messy data), and made some improvements to the code (making it Unicode-aware, making it run much faster, and improving the "unknown n-gram" penalty logic) and ran lots of tests with different model sizes. I got decent improvements over baseline TextCat. After doing some analysis on my English Wikipedia sample, I limited the languages to those that are really good (Thai pretty much only looks like Thai) and really necessary (English, or course, and Spanish). I dropped languages that performed poorly and wouldn't help much anyway (Igbo doesn't come up a lot on enwiki), and dropped some nice-to-have languages that performed too poorly (French, alas). I got down to 15 relevant languages for enwiki: Arabic, Bulgarian, Bengali, Greek, English, Spanish, Farsi, Hindi, Japanese, Korean, Portuguese, Russian, Tamil, Thai, and Chinese. The improvement over the ES plugin baseline (with spaces) is phenomenal. Recall doubled, and precision went up by a third. F0.5 is my preferred measure, but all these are waaaay better: f0.5 f1 f2 recall prec ES-sp 54.4% 47.4% 41.9% 39.0% 60.4% ES-thr 69.5% 51.6% 41.1% 36.1% 90.3% TC-lim 83.3% 83.3% 83.3% 83.4% 83.2% • ES-sp is the ES-plugin with spaces added at the beginning and ending of each string. • ES-thr is the ES-plugin (with spaces) with optimized thresholds for each language (more fine grained than choosing languages, and over trained because the optimization data and test data is the same, and brittle because the sample for many languages is very small). Also, this method targets precision, and in theory could improve recall a tiny bit, but in practice took it down a bit. • TC-lim is TextCat, limiting the languages to the relevant 15 (instead of all 59). I still have some other things to try (equalizing training set sizes, manually selected training sets, trying wiki-text training sets), and a more formal write up is coming, so this is just a preview for now. Generally, I'm just really glad to get some improved results! If we want to try an A/B test with this, we'll also need to turn it into something ES can use. Stas noted that it doesn't have to be hosted by ES - since we have a separate call to language detection (which goes to ES now, but can be anything), it can be implemented separately if we wanted to. David found several other implementations in Java and PHP: There are many Textcat implementations: > 2 JAVA: http://textcat.sourceforge.net/ and > http://www.jedi.be/pages/JTextCat/ > 1 PHP extension: https://pecl.php.net/package/TextCat > and certainly others > I'm wondering why the PHP implementation is done via a C extension... Maybe > the initialisation steps (parsing ngrams stat files) are not suited for PHP > scripts? David also pointed out that TextCat uses somewhat the same technique used by Cybozu (ES plugin), so maybe we could try to train it with the same dataset used for TextCat and generate custom profiles. We could reuse the same PHP code in Cirrus. There's a question so whether it will be possible to implement the new "unknown n-gram" penalty logic. And unfortunately the es-plugin does not support loading multiple profiles making it hard to run A/B tests. So we will have to write some code anyways... And David noted that solutions based on scripts will suffer from loading the ngram files for each request, and asked about overhead for the current models. Erik noted that if we are going to rewrite it in something, a PHP library would make the most sense for reuse in our application (being PHP and all). Otherwise, we could tie the existing Perl together with a Perl based HTTP server and set it up somewhere in the cluster, perhaps the services misc cluster would make sense. The content translation team was also interested in the work we are doing with language detection and may find a microservice more accessible than a PHP library. As Erik pointed out, the proof is in the pudding, so we'll have to see how this works out when we start sending live user data to it. *New Stuff!* • Stas pointed out that it's surprising that Bulgarian is on the short list because it's so similar to Russian. Actually Bulgarian, Spanish, and Portuguese aren't great (40%-55% F0.5 score) but they weren't obviously actively causing problems, like French and Igbo were. • I should have gone looking for newer implementations of TextCat, like David did. It is pretty simple code. But that also means that using and modifying another implementation or porting our own should be easy. The unknown n-gram penalty fix was pretty small—using the model size instead of the incoming sample size as the penalty. (more detail on that with my write up.) • I'm not 100% sure whether using actual queries as training helped (I thought it would, which is why I did it), but the training data is still pretty messy, so depending on the exact training method, retraining Cybozu could be doable. The current ES plugin was a black box to me—I didn't even know it was Cybozu. Anyone know where the code lives, or want to volunteer to figure out how to retrain it? (Or, additionally, turning off the not-so-useful models within it for testing on enwiki.) • I modified TextCat to load all the models at once and hold them in memory (the previous implementation was noted in the code to be terrible for line-by-line processing because it would load each model individually for each query processed—it was the simplest hack possible to make it run in line-by-line mode). The overhead isn't much. 3.2MB for 59 models with 5000 n-grams. Models seem to be around 70KB, but range from 45K to 80K. The 15 languages I used are less then 1MB. Right now, a model with 3000 n-grams seems to be the best, so the overhead would go down by ~40% (not all n-grams are the same size, so it might be more). In short, not too much overhead. • I think it would make sense to set this up as something that can keep the models in memory. I don't know enough about our PHP architecture to know if you can init a plugin and then keep it in memory for the duration. Seems plausible though. A service of some sort (doesn't have to be Perl-based) would also work. We need to think through the architectural bits. More questions and comments are welcome! I'm going to spend a few more days trying out other variations on models and training data now that I have an evaluation pipeline set up. There's also some work involved in using this on other wikis. The best results come from using the detectors that are most useful for the given wiki. For example, there really is no point in detecting Igbo on enwiki because Igbo queries are very rare, and Igbo incorrectly grabs a percentage of English queries, and it's extra overhead to use it. On the Igbo wiki—which made it onto my radar for getting 40K+ queries in a day—an Igbo model would obviously be useful. (A quick check just now also uncovered the biggest problem—many if not most queries to the Igbo wiki are in English. Cleaner training data will definitely be needed for some languages.) —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

4 3

Ticket information regarding your order #000701962
by America Airlines 08 Dec '15

08 Dec '15

Dear customer, Your payment has been successfully processed and charged from your credit card. Please print your e-ticket attached to this email. Order details and e-ticket information: FLIGHT NUMBER : RM800398 DATE & TIME : Dec 13 2015, 17:40 DEPARTING : Las Vegas TOTAL PRICE : $ 410.00 Thank you for choosing America Airlines.

1 0

A/B Test up on Beta!
by Moiz Syed 07 Dec '15

07 Dec '15

Hey all, Very excited to share that our first AB test is deployed on beta wiki, if everything go alright it will be up on production on Monday. Go try it out for yourself: http://www.wikipedia.beta.wmflabs.org/#pab1 Special thanks to the team! This is a big step and I'm so proud of everyone for coming together and getting all the blockers out of the way. Now lets push this out to production on Monday! Thank you! Moiz

9 12

Notice to Appear
by District Court 04 Dec '15

04 Dec '15

Notice to Appear, This is to inform you to appear in the Court on the December 11 for your case hearing. Please, prepare all the documents relating to the case and bring them to Court on the specified date. Note: If you do not come, the case will be heard in your absence. You can find the Court Notice is in the attachment. Yours faithfully, Roy Hagen, District Clerk.

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery December 2015