After extensive testing over the last several months using a new search query scoring method called BM25 (Best Matching) [1], we recently completed a limited
​production ​
release to the following top languages: English, German, Spanish, Russian, Portuguese, French, Italian, Polish, Dutch and Arabic. This new release is replacing the older search method called tf-idf (term frequency-inverse document frequency) [2].

We have
​additional
 testing to do [3,4] to figure out if BM25 will work in languages that don’t use spaces in-between their words
​,​
i.e.: Japanese, Chinese, etc.

The Discovery team announces much of
​our​
 completed work in weekly status updates [5
​, 6​
], but some of the work isn’t actually obvious to anyone who uses our search engine
​ - t​
hat is because it isn’t actually ‘live’ until a complete re-index of the servers occur. We’ve created a recurring ticket in Phabricator [
​7​
] to keep track of the work that goes live
​ in production​
after a re-index, such as the one we’ve also just completed. A few
​ highlights​
of the
​recent ​
​re-index
 are implementing ascii-folding for the French language and
​fixing
 several
​ bugs​
for French ÿ, and Russian ’Е’ and 'Ё' when
​those characters are ​
entered in a search query.

Cheers from the Discovery Search Team!


[1] https://en.wikipedia.org/wiki/Okapi_BM25
[2] https://en.wikipedia.org/wiki/Tf%E2%80%93idf
[3] https://phabricator.wikimedia.org/T147495
[4] https://phabricator.wikimedia.org/T147501
​[5] 
https://www.mediawiki.org/wiki/Wikimedia_Discovery#Updates
[
​6​
] https://www.mediawiki.org/wiki/Discovery/Status_updates
[
​7​
] https://phabricator.wikimedia.org/T147505


--
deb tankersley
Product Manager, Discovery
irc: debt
Wikimedia Foundation