Le 26/01/2016 11:20, billinghurst a écrit :
I would think that 1/4 of
searches for "po..." are not for pornhub, though I am not aware that
such data is available.
Yes it's the main problem we have today, score is computed from document
metadata (size, templates, headings, incoming_links... and now pageviews).
Search usage is not part of the score: we suggest pages not search queries.
Another problem I have today is that I don't have any good method to
evaluate the quality of the formula.
I've added a small page on wikitech that describes the formula[1]. It's
the R script I use to briefly evaluate the score distribution before
testing on en-suggesty. Note that this page is not necessarily updated
with the latest params, gerrit[2] may contain up-to-date params with
what you can see on en-suggesty.
Another data I failed to use is term statistics from the prefixsearch
index[2], it helps to see the level of ambiguity of a prefix according
to its length.
Any suggestions to improve the method and/or the formula are very welcome.
Thanks!
[1]
https://wikitech.wikimedia.org/wiki/User:DCausse/Completion_Suggester_And_P…
[2]
https://gerrit.wikimedia.org/r/#/c/265771/
[3]
https://wikitech.wikimedia.org/wiki/User:DCausse/Term_Stats_With_Cirrus_Dump