Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward
with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org>
wrote:
Hey folks,
As discussions on the new page view definition have been calming down,
we're preparing to deliver a draft version to the Devs. I want to make
sure that we all know the status and that any substantial concerns are
raised before we hand things off on *Friday, Dec 12th.*
For this phase, we are delivering the general filter[1]. This is the
highest level filter, and exists primarily to distinguish requests worthy
of further evaluation. Our plan is to take the definition as it exists on
the 12th, and begin generating high-level aggregate numbers based on it. In
future iterations, we will be digging into different breakdowns of this
metric, and iterating on it to handle any inconsistencies or unexpected
results. There's a few differences from Web Stat Collector's (WSC) version
of the general filter that we want to call to your attention to.
- We include searches -- WSC explicitly excludes them.
- We include Apps traffic -- WSC does not detect Apps traffic
- We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) --
WSC hardcodes "/wiki/"
- We don't include Banner impressions -- WSC includes them.
There are also some known issues with the new definition that are worth
your notice:
1. *Internal traffic is counted*
- Note that WSC filters some internal traffic by hardcoding a set of
IPs in the definition. We are working on parsing puppet templates in order
to automatically detect which IPs represent internal traffic. This will be
a /better/ solution, but it's not quite ready yet because parsing puppet is
hard.
1. *Spider traffic is counted*
- We will be using the User-agent field to detect and flag
spider-based traffic. This "tag definition" will be delivered in a
subsequent definition. This actually matches WSC, which does not filter
spider for the high-level metrics.
These are problems we're aware of, and will be factoring in as we go
forward with our next task: refining the definition using real,
hourly-level traffic data. Thanks to everyone who has given feedback and
participated in the process thus far, particularly Nemo, Erik, and
Christian.
1.
https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org