There is another similar article where they tested a different search engine: http://www.searchtechnologies.com/querying-indexing-cloudsearch
Some takeaways: * Considers longer articles more important * Considers shorter titles more important (aka Germany vs List of German Corps in World War II) * Some hand tweaking ended up with the formula: text_relevance + 40.0*log10(content_size) - 15.0*log10(title_size) * defined a per-document boost from 0 to 10 based on which namespace something belongs to. * tweaked formula into: ext_relevance + (log10(content_size)*(doc_boost == 1 ? 25.0 : 40.0)) - (log10(title_size)*15)
On Thu, Jul 7, 2016 at 10:29 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
Semi interesting post from Search Technologies (aka Paul Score) about indexing wikipedia data: http://www.searchtechnologies.com/wikipedia-azure-search
Takeaways:
- Automated entity detection, categorizing into person/place/organization
- Offers search facets by wikipedia category and by entity detection
- Multiple scoring profiles offered which change the weight between title
and description (content? not clear)