Discovery April 2016

discovery@lists.wikimedia.org

17 participants
29 discussions

by Trey Jones

Hi Everyone, I've just finished my write-up for optimizing the languages that could eventually be used for language detection on French Wikipedia. (Spanish, Italian, and German are still to come.) The full write-up <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization…> gives details on corpus creation and clean up, performance stats, and more. Briefly, about 15% of "low performing" queries (those with < 3 results) are easily filtered junk, and 65% of the remainder are not in an identifiable language (e.g., names, acronyms, more junk, etc.). Based on a sample of 682 poor-performing queries on frwiki that are in some language, about 70% are in French, 10-15% are in English, about 7-12% are in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there are a handful of other languages present. Because of the relatively low percentage of low-performing queries that are relevant, we will still need to do an A/B test before discussing deploying this to frwiki. An A/B test on enwiki <https://phabricator.wikimedia.org/T121542> in in the works at the moment. The optimal settings for frwiki, based on these experiments, would be to use the TextCat query-based models for French, English, Arabic, Russian, Chinese, Armenian, Thai, Greek, Hebrew, Korean (fr, en, ar, ru, zh, th, el, hy, he, ko), using the default 3000-ngram models. —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

7 years, 11 months

Discovery Weekly Update for the week starting 2016-04-25

by Chris Koerner

Howdy, A few updates on the work of the Discovery team this week. Thanks to those who sent something in, and thank you for reading. * David, Erik, and Trey had a discussion with JustinO about improving recall and improving search in general. Semi-readable notes are in an etherpad <https://etherpad.wikimedia.org/p/Recall>. Additional thoughts and comments are welcome. * Deb and Moiz had discussions with Abbey, Daisy and Edward about upcoming surveys planned for the Wikipedia portal; awaiting Legal approval. * The updated Perl version of TextCat is now available on GitHub <https://github.com/Trey314159/TextCat>. Reminder: the PHP version <https://github.com/wikimedia/wikimedia-textcat> has been available for a while. * Analysis has been completed <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization…> on low-performing [search] queries (< 3 results) on French, Spanish, and Italian Wikipedias, to optimize performance on language identification with TextCat on those wikis. German is coming up next. * Analysis is complete on recent Wikipedia.org portal page A/B test: Wikipedia Portal Test of Language Detection and Primary Link Resorting <https://commons.wikimedia.org/wiki/File:Wikipedia_Portal_Test_of_Language_D…> * Maps servers now have 16 varnish servers instead of 2 in 4 different data centers instead of 1. * A new Wikipedia portal A/B test will be released next week - it will add in descriptive text to the sister project links at the bottom of the page. * Search for a new analyst (Oliver's replacement) is going along well. We got 169 applicants and some of them have done or are scheduled to do the take-home analysis task. ---- Feedback and suggestions on this weekly update are welcome. The full update, and archive of past updates, can be found on Mediawiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates -- Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

7 years, 12 months

Fwd: [discovery-private] Fwd: Lets play Name That Thing!

by Dan Garry

Forwarding this email, which accidentally went to the internal Discovery list. We decided on "Discernatron". :-) Dan ---------- Forwarded message ---------- From: Erik Bernhardson <ebernhardson(a)wikimedia.org> Date: 21 April 2016 at 14:18 Subject: Re: [discovery-private] Fwd: [discovery] Lets play Name That Thing! To: Internal communications for WMF search and discovery team < discovery-private(a)lists.wikimedia.org> Seems like Discernatron is the winner! I've create the repo at https://gerrit.wikimedia.org/r/wikimedia/discovery/discernatron and pushed the current state of the code there. I'll update any references in the code base and get a new version (with lots of other small updates i made yesterday as well) up sometime today. We also have a filtered set of queries that me and Trey agreeded on, and OAuth credentials to use with meta.mediawiki.org for logins. One of the last sticking points for getting this pushed out is coming up with good instructions for users so they give us good information. Still debating :S I'll initially put this up today or tomorrow with a dozen queries we use for testing and ask people to try it out and let me know what can be fixed for a roll out + small announcement next week. -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

7 years, 12 months

Discovery standup today - cannot attend

by Guillaume Lederrey

Hello! I'll be in the "Writing a Self-Review" workshop today. You said so much good about it that I can't miss it. It is in conflict with our standup so here is my status: * codfw elasticsearch cluster has been behaving erratically yesterday. This is probably related to a copy/paste error (by myself) in the unicast configuration (I mixed up eqiad and codfw). It does not completely explain why we started seeing issues only after the change was deployed to all but one server for > 12 hours. In the end, a full cluster restart fixed the issue, but we still don't really understand it. * started the restart of eqiad elasticsearch cluster (please all cross your fingers!) * WDQS data reload: in progress. I did not check properly that the latest version was deployed before starting the data import, so we actually loaded it with the wrong version (so geo indexing not yet enabled). Another data reload will be required (already in progress on wdqs1002, will do it afterward on wdqs1001). * Spent quite some time going through all the phabricator issues I am subscribed to and gerrit changes, did some cleaning. * new elasticsearch servers are almost ready to be installed, I spent some time understanding how installing them actually works. Will probably start that tonight or tomorrow. -- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation

7 years, 12 months

Portals....and portals

by Kevin Smith

I just encountered this page and concept: https://en.wikipedia.org/wiki/Portal:Contents/Portals As a result, I'm wondering if we should stop referring to https://wikipedia.org as "the portal"? Kevin Smith Agile Coach, Wikimedia Foundation

8 years

Discovery Weekly Update for the week starting 2016-04-11

by Chris Koerner

Hello, A few updates from the Discovery department this week. * Portal team brainstormed ideas for the future of the Portal work and beyond; read more here <https://www.mediawiki.org/wiki/Wikipedia.org_Portal_brainstorming_ideas_for…> . * The zero results rate <http://discovery.wmflabs.org/metrics/#kpi_zero_results> [for search] was updated to no longer count queries that it should've been excluding. This shows that the completion suggester had a bigger effect than previously thought, actually reducing the zero results rate from roughly 33% to roughly 22%. ---- Feedback and suggestions on this weekly update are welcome. The full update, and archive of past updates, can be found on Mediawiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates -- Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

8 years

Village pump discussion about Wikipedia search improvements

by Tilman Bayer

Might be of interest: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#Watson_and… -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB

8 years

Day off today for Guillaume

by Guillaume Lederrey

Hello teams! I finally spent the night and most of the day at the hospital with Oscar. He is feeling much better now, but I'm mostly useless for any kind of work. I'll be back on Monday... Take care! Guillaume -- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation

8 years

Lets play Name That Thing!

by Erik Bernhardson

As some of you may be aware I've been working on a judgement platform to collect human judgement of search results, that we will feed into the relevance forge. This will live at https://relevance.wmflabs.org. Right now it's called 'WikiMedia Search Result Scorer' which feels pretty blah, so opening up for some suggestions! Hoping to push this out next week.

8 years

Tasks for upgrading to Elasticsearch 2.3

by Dan Garry

In yesterday's sprint planning meeting for Search, with a lot of guidance from Erik, I organised a bunch of tasks related to our quarterly goal to upgrade to Elasticsearch 2.3. This email summarises these tasks. The overarching epic is T133120 <https://phabricator.wikimedia.org/T133120> "Upgrade Wikimedia search cluster to use Elasticsearch 2.3". There are a lot of subtasks of this epic; take a look at the task description and blocking tasks for more information. This epic blocks T133119 <https://phabricator.wikimedia.org/T133119> "Upgrade completion suggester so that its index updates in real time", which is the second part of the goal. Hopefully, everything should be clear and organised. I've laid all the tasks out in the "This Quarter" of the new discovery-search-backlog project <https://phabricator.wikimedia.org/tag/discovery-search-backlog/>. Let me know if there are any questions, and I'd be happy to answer them. Thanks! Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

8 years

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery April 2016