Hello!
As part of our goals for Q3 FY 2016-17 https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q3_Goals#Discovery (Jan - Mar 2017), the Search Team will be researching, testing, and deploying new language analysers.
Language analysers are features in Elasticsearch that analyse and alter queries to give users better results. Language analysers perform important functions such as tokenisation https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis), and can also alter queries with language-specific features, such as:
- The English analyser would make the query "john's" also search for "john". - The German analyser would make the query "äußerst" also search for "ausserst".
These alteration to users queries improve the relevance of the results given to users compared to not analysing the queries, because they can add extra documents that may be relevant into the results. Elastic has a bunch of documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html if you want to read more about the language analysers do.
Some of the criteria we'll be using to evaluate the new analysers are:
- how much better we expect the analyser to be than the one we have - the maturity and maintainability of the code of the analyser - flexibility of customisation of the plugin
We'll be testing using our standard search metrics, such as zero results rate, PaulScore https://www.mediawiki.org/wiki/Wikimedia_Discovery/Search/Glossary#PaulScore, and others.
We'll be starting with Polish, since we already have some ideas for possible new plugins, and that'll allow us to more precisely figure out what criteria we want to use when evaluating the plugin.
As always, if there are any questions, please let me know!
Thanks, Dan
Did we ever look into whether we managed to address all that the custom Lucene code used to do, especially for Russian? https://wikitech.wikimedia.org/wiki/Search/2013#Search_details_.28Java.29
While we're at it, perhaps Hebrew's tokenization can be improved: https://phabricator.wikimedia.org/T154348#2912086
Starting with Polish makes sense, however.
Nemo
Indeed from time to time I have to read lsearch2 code to understand what was done before cirrus was deployed. Concerning Russian I think we do, apparently lsearchd used a simple wrapper to the lucene russian stemmer [1]. If there are some other custom code or if you are aware of some regressions I'd appreciate some links so we can track them. I remember having seen some code (js gadgets?) that does some custom russian stemming...
Concerning Hebrew I hope we can find a good analyzer, according to the comments in the code the hebrew analyzer that was tested appeared to be unstable and was disabled. I hope that things are in better shape now, this is the whole purpose of this new goal, allocate some "official" bandwidth to fixing/improving language analyzers.
One of the problem we will have to address is the maintainability of all these language analyzers, we decided to start with polish because one of them is supported by elastic itself. This is a guarantee for us that the code will always be up to date. There are many analyzers but too frequently the code is not maintained or too custom to be properly integrated in our stack.
[1] https://github.com/wikimedia/operations-debs-lucene-search-2/blob/master/src...
On Wed, Jan 4, 2017 at 9:28 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Did we ever look into whether we managed to address all that the custom Lucene code used to do, especially for Russian? https://wikitech.wikimedia.org/wiki/Search/2013#Search_details_.28Java.29
While we're at it, perhaps Hebrew's tokenization can be improved: https://phabricator.wikimedia.org/T154348#2912086
Starting with Polish makes sense, however.
Nemo
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
David Causse, 05/01/2017 09:36:
Indeed from time to time I have to read lsearch2 code to understand what was done before cirrus was deployed.
:)
Concerning Russian I think we do, apparently lsearchd used a simple wrapper to the lucene russian stemmer [1]. If there are some other custom code or if you are aware of some regressions I'd appreciate some links so we can track them. I remember having seen some code (js gadgets?) that does some custom russian stemming...
I remember seeing some file with long lists of rules for Cyrillic, but maybe it was SerbianFilter.java .
Concerning Hebrew I hope we can find a good analyzer, according to the comments in the code the hebrew analyzer that was tested appeared to be unstable and was disabled.
Ah, makes sense. Thanks!
Nemo