There is another similar article where they tested a different search
engine:
http://www.searchtechnologies.com/querying-indexing-cloudsearch
Some takeaways:
* Considers longer articles more important
* Considers shorter titles more important (aka Germany vs List of German
Corps in World War II)
* Some hand tweaking ended up with the formula: text_relevance +
40.0*log10(content_size) - 15.0*log10(title_size)
* defined a per-document boost from 0 to 10 based on which namespace
something belongs to.
* tweaked formula into: ext_relevance + (log10(content_size)*(doc_boost ==
1 ? 25.0 : 40.0)) - (log10(title_size)*15)
On Thu, Jul 7, 2016 at 10:29 AM, Erik Bernhardson <
ebernhardson(a)wikimedia.org> wrote:
Semi interesting post from Search Technologies (aka
Paul Score) about
indexing wikipedia data:
http://www.searchtechnologies.com/wikipedia-azure-search
Takeaways:
* Automated entity detection, categorizing into person/place/organization
* Offers search facets by wikipedia category and by entity detection
* Multiple scoring profiles offered which change the weight between title
and description (content? not clear)