On 23/08/18 23:10, Stas Malyshev wrote:
Hi!
On 8/23/18 2:07 PM, Daniel Mietchen wrote:
On Thu, Aug 23, 2018 at 10:44 PM fn@imm.dtu.dk wrote:
I was wondering why our research section was number 8. Then I recalled our dashboard running from "http://people.compute.dtu.dk/faan/cognitivesystemswikidata1.html". It updates around each 3 minute all day long...
Such automated queries should not be in the organic query file that I looked at.
If it's a browser page and the underlying code does not set distinctive user agent, I think they will be. It'd be hard to identify such cases otherwise (ccing Markus in case he knows more on the topic).
Yes, the "organic" file is a subset of the queries from agents that pretended to be a browser. We filtered agents and query patterns that were clearly not "human-like" but a tool that asks one query every 3 min would not be recognised at this level.
Such a tool would also not strongly affect most statistics, but it can in cases of statistics that have an extremely high number of possible values (e.g., items used in the query). In such cases, normal "organic" traffic is usually so diverse, that no individual value receives much prominence, so that even a rather small number of queries from one source could have an impact.
In general, popularity measures based on query traffic, even on the organic part, must be taken with caution, because of the many effects that lead to skewed query volumes from a particular source (without this necessarily indicating real "popularity"). It is an open question how one should best evaluate the traffic in the presence of these skews. Our two-class system of "robotic" [massively skewed] and "organic" [less skewed] is only a first step there.
Best,
Markus