Le 26/01/2016 11:20, billinghurst a écrit :
I would think that 1/4 of searches for "po..." are not for pornhub, though I am not aware that such data is available.
Yes it's the main problem we have today, score is computed from document metadata (size, templates, headings, incoming_links... and now pageviews). Search usage is not part of the score: we suggest pages not search queries.
Another problem I have today is that I don't have any good method to evaluate the quality of the formula. I've added a small page on wikitech that describes the formula[1]. It's the R script I use to briefly evaluate the score distribution before testing on en-suggesty. Note that this page is not necessarily updated with the latest params, gerrit[2] may contain up-to-date params with what you can see on en-suggesty. Another data I failed to use is term statistics from the prefixsearch index[2], it helps to see the level of ambiguity of a prefix according to its length.
Any suggestions to improve the method and/or the formula are very welcome.
Thanks!
[1] https://wikitech.wikimedia.org/wiki/User:DCausse/Completion_Suggester_And_Pa... [2] https://gerrit.wikimedia.org/r/#/c/265771/ [3] https://wikitech.wikimedia.org/wiki/User:DCausse/Term_Stats_With_Cirrus_Dump