On 23/08/18 23:10, Stas Malyshev wrote:
Hi!
On 8/23/18 2:07 PM, Daniel Mietchen wrote:
On Thu, Aug 23, 2018 at 10:44 PM
<fn(a)imm.dtu.dk> wrote:
I was wondering why our research section was
number 8. Then I recalled
our dashboard running from
"http://people.compute.dtu.dk/faan/cognitivesystemswikidata1.html". It
updates around each 3 minute all day long...
Such automated queries should not be in the organic query file that I looked at.
If it's a browser page and the underlying code does not set distinctive
user agent, I think they will be. It'd be hard to identify such cases
otherwise (ccing Markus in case he knows more on the topic).
Yes, the "organic" file is a subset of the queries from agents that
pretended to be a browser. We filtered agents and query patterns that
were clearly not "human-like" but a tool that asks one query every 3 min
would not be recognised at this level.
Such a tool would also not strongly affect most statistics, but it can
in cases of statistics that have an extremely high number of possible
values (e.g., items used in the query). In such cases, normal "organic"
traffic is usually so diverse, that no individual value receives much
prominence, so that even a rather small number of queries from one
source could have an impact.
In general, popularity measures based on query traffic, even on the
organic part, must be taken with caution, because of the many effects
that lead to skewed query volumes from a particular source (without this
necessarily indicating real "popularity"). It is an open question how
one should best evaluate the traffic in the presence of these skews. Our
two-class system of "robotic" [massively skewed] and "organic" [less
skewed] is only a first step there.
Best,
Markus