Indeed from time to time I have to read lsearch2 code to understand what was done before cirrus was deployed. Concerning Russian I think we do, apparently lsearchd used a simple wrapper to the lucene russian stemmer [1]. If there are some other custom code or if you are aware of some regressions I'd appreciate some links so we can track them. I remember having seen some code (js gadgets?) that does some custom russian stemming...
Concerning Hebrew I hope we can find a good analyzer, according to the comments in the code the hebrew analyzer that was tested appeared to be unstable and was disabled. I hope that things are in better shape now, this is the whole purpose of this new goal, allocate some "official" bandwidth to fixing/improving language analyzers.
One of the problem we will have to address is the maintainability of all these language analyzers, we decided to start with polish because one of them is supported by elastic itself. This is a guarantee for us that the code will always be up to date. There are many analyzers but too frequently the code is not maintained or too custom to be properly integrated in our stack.
[1] https://github.com/wikimedia/operations-debs-lucene-search-2/blob/master/src...
On Wed, Jan 4, 2017 at 9:28 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Did we ever look into whether we managed to address all that the custom Lucene code used to do, especially for Russian? https://wikitech.wikimedia.org/wiki/Search/2013#Search_details_.28Java.29
While we're at it, perhaps Hebrew's tokenization can be improved: https://phabricator.wikimedia.org/T154348#2912086
Starting with Polish makes sense, however.
Nemo
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery