I've just finished my write-up for optimizing the languages that could
eventually be used for language detection on French Wikipedia. (Spanish,
Italian, and German are still to come.)
The full write-up
on corpus creation and clean up, performance stats, and more.
Briefly, about 15% of "low performing" queries (those with < 3 results) are
easily filtered junk, and 65% of the remainder are not in an identifiable
language (e.g., names, acronyms, more junk, etc.).
Based on a sample of 682 poor-performing queries on frwiki that are in some
language, about 70% are in French, 10-15% are in English, about 7-12% are
in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there
are a handful of other languages present.
Because of the relatively low percentage of low-performing queries that are
relevant, we will still need to do an A/B test before discussing deploying
this to frwiki. An A/B test on enwiki
<https://phabricator.wikimedia.org/T121542> in in the works at the moment.
The optimal settings for frwiki, based on these experiments, would be to
use the TextCat query-based models for French, English, Arabic, Russian,
Chinese, Armenian, Thai, Greek, Hebrew, Korean (fr, en, ar, ru, zh, th, el,
hy, he, ko), using the default 3000-ngram models.
Software Engineer, Discovery
A few updates on the work of the Discovery team this week. Thanks to those
who sent something in, and thank you for reading.
* David, Erik, and Trey had a discussion with JustinO about improving
recall and improving search in general. Semi-readable notes are in an
etherpad <https://etherpad.wikimedia.org/p/Recall>. Additional thoughts and
comments are welcome.
* Deb and Moiz had discussions with Abbey, Daisy and Edward about upcoming
surveys planned for the Wikipedia portal; awaiting Legal approval.
* The updated Perl version of TextCat is now available on GitHub
<https://github.com/Trey314159/TextCat>. Reminder: the PHP version
<https://github.com/wikimedia/wikimedia-textcat> has been available for a
* Analysis has been completed
on low-performing [search] queries (< 3 results) on French, Spanish, and
Italian Wikipedias, to optimize performance on language identification with
TextCat on those wikis. German is coming up next.
* Analysis is complete on recent Wikipedia.org portal page A/B test: Wikipedia
Portal Test of Language Detection and Primary Link Resorting
* Maps servers now have 16 varnish servers instead of 2 in 4 different data
centers instead of 1.
* A new Wikipedia portal A/B test will be released next week - it will add
in descriptive text to the sister project links at the bottom of the page.
* Search for a new analyst (Oliver's replacement) is going along well. We
got 169 applicants and some of them have done or are scheduled to do the
take-home analysis task.
Feedback and suggestions on this weekly update are welcome.
The full update, and archive of past updates, can be found on Mediawiki.org:
Community Liaison - Discovery
Forwarding this email, which accidentally went to the internal Discovery
list. We decided on "Discernatron". :-)
---------- Forwarded message ----------
From: Erik Bernhardson <ebernhardson(a)wikimedia.org>
Date: 21 April 2016 at 14:18
Subject: Re: [discovery-private] Fwd: [discovery] Lets play Name That Thing!
To: Internal communications for WMF search and discovery team <
Seems like Discernatron is the winner! I've create the repo at
https://gerrit.wikimedia.org/r/wikimedia/discovery/discernatron and pushed
the current state of the code there. I'll update any references in the code
base and get a new version (with lots of other small updates i made
yesterday as well) up sometime today.
We also have a filtered set of queries that me and Trey agreeded on, and
OAuth credentials to use with meta.mediawiki.org for logins. One of the
last sticking points for getting this pushed out is coming up with good
instructions for users so they give us good information. Still debating :S
I'll initially put this up today or tomorrow with a dozen queries we use
for testing and ask people to try it out and let me know what can be fixed
for a roll out + small announcement next week.
Lead Product Manager, Discovery
I'll be in the "Writing a Self-Review" workshop today. You said so
much good about it that I can't miss it. It is in conflict with our
standup so here is my status:
* codfw elasticsearch cluster has been behaving erratically yesterday.
This is probably related to a copy/paste error (by myself) in the
unicast configuration (I mixed up eqiad and codfw). It does not
completely explain why we started seeing issues only after the change
was deployed to all but one server for > 12 hours. In the end, a full
cluster restart fixed the issue, but we still don't really understand
* started the restart of eqiad elasticsearch cluster (please all cross
* WDQS data reload: in progress. I did not check properly that the
latest version was deployed before starting the data import, so we
actually loaded it with the wrong version (so geo indexing not yet
enabled). Another data reload will be required (already in progress on
wdqs1002, will do it afterward on wdqs1001).
* Spent quite some time going through all the phabricator issues I am
subscribed to and gerrit changes, did some cleaning.
* new elasticsearch servers are almost ready to be installed, I spent
some time understanding how installing them actually works. Will
probably start that tonight or tomorrow.
Operations Engineer, Discovery
I finally spent the night and most of the day at the hospital with
Oscar. He is feeling much better now, but I'm mostly useless for any
kind of work. I'll be back on Monday...
Operations Engineer, Discovery
As some of you may be aware I've been working on a judgement platform to
collect human judgement of search results, that we will feed into the
relevance forge. This will live at https://relevance.wmflabs.org. Right
now it's called 'WikiMedia Search Result Scorer' which feels pretty blah,
so opening up for some suggestions! Hoping to push this out next week.
In yesterday's sprint planning meeting for Search, with a lot of guidance
from Erik, I organised a bunch of tasks related to our quarterly goal to
upgrade to Elasticsearch 2.3. This email summarises these tasks.
The overarching epic is T133120
Wikimedia search cluster to use Elasticsearch 2.3". There are a lot of
subtasks of this epic; take a look at the task description and blocking
tasks for more information. This epic blocks T133119
<https://phabricator.wikimedia.org/T133119> "Upgrade completion suggester
so that its index updates in real time", which is the second part of the
Hopefully, everything should be clear and organised. I've laid all the
tasks out in the "This Quarter" of the new discovery-search-backlog project
<https://phabricator.wikimedia.org/tag/discovery-search-backlog/>. Let me
know if there are any questions, and I'd be happy to answer them.
Lead Product Manager, Discovery