New subject: This quarter: researching new language analysers for search

4 Jan 2017

Hello!

As part of our goals for Q3 FY 2016-17
<https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q3_Goals#Discovery>
(Jan - Mar 2017), the Search Team will be researching, testing, and
deploying new language analysers.

Language analysers are features in Elasticsearch that analyse and alter
queries to give users better results. Language analysers perform important
functions such as tokenisation
<https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)>, and can
also alter queries with language-specific features, such as:

   - The English analyser would make the query "john's" also search for
   "john".
   - The German analyser would make the query "äußerst" also search for
   "ausserst".

These alteration to users queries improve the relevance of the results
given to users compared to not analysing the queries, because they can add
extra documents that may be relevant into the results. Elastic has a bunch
of documentation
<https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html>
if
you want to read more about the language analysers do.

Some of the criteria we'll be using to evaluate the new analysers are:

   - how much better we expect the analyser to be than the one we have
   - the maturity and maintainability of the code of the analyser
   - flexibility of customisation of the plugin

We'll be testing using our standard search metrics, such as zero results
rate, PaulScore
<https://www.mediawiki.org/wiki/Wikimedia_Discovery/Search/Glossary#PaulScore>,
and others.

We'll be starting with Polish, since we already have some ideas for
possible new plugins, and that'll allow us to more precisely figure out
what criteria we want to use when evaluating the plugin.

As always, if there are any questions, please let me know!

Thanks,
Dan

-- 
Dan Garry
Lead Product Manager, Discovery
Wikimedia Foundation