OK, I'll try to work on a small doc to describe the features available
in cirrus from the query parser to fallback methods.
I'll try to map some query classes to each of these features and see
what needs to be added by cirrus to these logs.
But I think we can already start to classify some queries with the UDF
to detect special syntax (AND/OR/NOT) and the number of results.
The work on pageviews directly address a class of queries that are
identifiable with the data available in these logs today e.g.:
- One word query with more than X results will be directly affected by
the addition of pageviews in the ranking.
- 2 words or more with more than X results will be also affected but
another feature that relates to words proximity can take precedence.
I still don't know what makes sense for X but it's a minimum of 20 (we
display 20 results by default).
Le 02/12/2015 15:46, Oliver Keyes a écrit :
Well, the query classification Mikhail was suggesting
involved adding
data to the logs. So in and of itself, this does not help. But it is a
fantastic achievement, and I am looking forward to switching our data
collection scripts over to using this.
On 2 December 2015 at 09:30, David Causse <dcausse(a)wikimedia.org> wrote:
Hi,
The work started by Erik few month ago is finally done. Cirrus requests are
now available in the hive table wmf_raw.CirrusSearchRequestSet.
I really hope this will help us to understand the kind of queries we are
serving and start to work on query classification as Mikhail suggested.
David.
_______________________________________________
discovery mailing list
discovery(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery