Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on *Friday, Dec 12th.*
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to.
- We include searches -- WSC explicitly excludes them. - We include Apps traffic -- WSC does not detect Apps traffic - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC hardcodes "/wiki/" - We don't include Banner impressions -- WSC includes them.
There are also some known issues with the new definition that are worth your notice:
1. *Internal traffic is counted*
- Note that WSC filters some internal traffic by hardcoding a set of IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard.
1. *Spider traffic is counted*
- We will be using the User-agent field to detect and flag spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics.
These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
1. https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
Awesome! Maybe better to parse pybal than puppet?
On Dec 9, 2014, at 20:15, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on Friday, Dec 12th.
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to. We include searches -- WSC explicitly excludes them. We include Apps traffic -- WSC does not detect Apps traffic We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC hardcodes "/wiki/" We don't include Banner impressions -- WSC includes them. There are also some known issues with the new definition that are worth your notice:
Internal traffic is counted Note that WSC filters some internal traffic by hardcoding a set of IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard. Spider traffic is counted We will be using the User-agent field to detect and flag spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics. These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
-Aaron & Oliver _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
We'll see. At the moment we don't actually know what we're /looking/ for ;p. I've started a distinct thread people can hopefully take a look at, covering this issue.
On 9 December 2014 at 21:46, Andrew Otto aotto@wikimedia.org wrote:
Awesome! Maybe better to parse pybal than puppet?
On Dec 9, 2014, at 20:15, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on *Friday, Dec 12th.*
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to.
- We include searches -- WSC explicitly excludes them.
- We include Apps traffic -- WSC does not detect Apps traffic
- We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) --
WSC hardcodes "/wiki/"
- We don't include Banner impressions -- WSC includes them.
There are also some known issues with the new definition that are worth your notice:
- *Internal traffic is counted*
- Note that WSC filters some internal traffic by hardcoding a set of
IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard.
- *Spider traffic is counted*
- We will be using the User-agent field to detect and flag
spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics.
These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on *Friday, Dec 12th.*
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to.
- We include searches -- WSC explicitly excludes them.
- We include Apps traffic -- WSC does not detect Apps traffic
- We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) --
WSC hardcodes "/wiki/"
- We don't include Banner impressions -- WSC includes them.
There are also some known issues with the new definition that are worth your notice:
- *Internal traffic is counted*
- Note that WSC filters some internal traffic by hardcoding a set of
IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard.
- *Spider traffic is counted*
- We will be using the User-agent field to detect and flag
spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics.
These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin tnegrin@wikimedia.org wrote:
Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on *Friday, Dec 12th.*
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to.
- We include searches -- WSC explicitly excludes them.
- We include Apps traffic -- WSC does not detect Apps traffic
- We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) --
WSC hardcodes "/wiki/"
- We don't include Banner impressions -- WSC includes them.
There are also some known issues with the new definition that are worth your notice:
- *Internal traffic is counted*
- Note that WSC filters some internal traffic by hardcoding a set of
IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard.
- *Spider traffic is counted*
- We will be using the User-agent field to detect and flag
spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics.
These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Excellent! I'm assuming the spec is considered "final" pending any clarification comparison with the WSC data?
thanks,
-Toby
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes okeyes@wikimedia.org wrote:
We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin tnegrin@wikimedia.org wrote:
Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on *Friday, Dec 12th.*
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to.
- We include searches -- WSC explicitly excludes them.
- We include Apps traffic -- WSC does not detect Apps traffic
- We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) --
WSC hardcodes "/wiki/"
- We don't include Banner impressions -- WSC includes them.
There are also some known issues with the new definition that are worth your notice:
- *Internal traffic is counted*
- Note that WSC filters some internal traffic by hardcoding a set of
IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard.
- *Spider traffic is counted*
- We will be using the User-agent field to detect and flag
spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics.
These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes okeyes@wikimedia.org wrote:
We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin tnegrin@wikimedia.org wrote:
Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on *Friday, Dec 12th.*
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to.
- We include searches -- WSC explicitly excludes them.
- We include Apps traffic -- WSC does not detect Apps traffic
- We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) --
WSC hardcodes "/wiki/"
- We don't include Banner impressions -- WSC includes them.
There are also some known issues with the new definition that are worth your notice:
- *Internal traffic is counted*
- Note that WSC filters some internal traffic by hardcoding a set of
IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard.
- *Spider traffic is counted*
- We will be using the User-agent field to detect and flag
spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics.
These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote: We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin <tnegrin@wikimedia.org mailto:tnegrin@wikimedia.org> wrote: Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote: Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on Friday, Dec 12th.
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to. We include searches -- WSC explicitly excludes them. We include Apps traffic -- WSC does not detect Apps traffic We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC hardcodes "/wiki/" We don't include Banner impressions -- WSC includes them. There are also some known issues with the new definition that are worth your notice:
Internal traffic is counted Note that WSC filters some internal traffic by hardcoding a set of IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard. Spider traffic is counted We will be using the User-agent field to detect and flag spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics. These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
- https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I think the hive code is "representative" in that it's an implementation. It's certainly not the only permitted one.
On Dec 15, 2014, at 10:34 AM, Andrew Otto aotto@wikimedia.org wrote:
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes okeyes@wikimedia.org wrote: We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin tnegrin@wikimedia.org wrote: Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote: Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on Friday, Dec 12th.
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to. We include searches -- WSC explicitly excludes them. We include Apps traffic -- WSC does not detect Apps traffic We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC hardcodes "/wiki/" We don't include Banner impressions -- WSC includes them. There are also some known issues with the new definition that are worth your notice:
Internal traffic is counted Note that WSC filters some internal traffic by hardcoding a set of IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard. Spider traffic is counted We will be using the User-agent field to detect and flag spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics. These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
It's totally tech-agnostic; the neutral definition is on meta. The hive query is just because, since we suspect that's how we'll be generating the data, it makes sense to turn the draft def into HQL for exploratory queries and testing.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
On 15 December 2014 at 13:46, Toby Negrin tnegrin@wikimedia.org wrote:
I think the hive code is "representative" in that it's an implementation. It's certainly not the only permitted one.
On Dec 15, 2014, at 10:34 AM, Andrew Otto aotto@wikimedia.org wrote:
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes okeyes@wikimedia.org wrote:
We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin tnegrin@wikimedia.org wrote:
Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on *Friday, Dec 12th.*
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to.
- We include searches -- WSC explicitly excludes them.
- We include Apps traffic -- WSC does not detect Apps traffic
- We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) --
WSC hardcodes "/wiki/"
- We don't include Banner impressions -- WSC includes them.
There are also some known issues with the new definition that are worth your notice:
- *Internal traffic is counted*
- Note that WSC filters some internal traffic by hardcoding a set
of IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard.
- *Spider traffic is counted*
- We will be using the User-agent field to detect and flag
spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics.
These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Ah cool, didn’t realize there was a neutral definition. We should call that the ‘formal specification’ then.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
Hm, a UDF that does this rather than a Hive query would probably be better. E.g.
SELECT request_qualifier(uri_host), count(*) FROM wmf_raw.webrequest WHERE is_pageview(uri_host, uri_path, http_status, content_type) GROUP BY request_qualifier(uri_host) ;
Or something like that.
-Ao
On Dec 15, 2014, at 14:07, Oliver Keyes okeyes@wikimedia.org wrote:
It's totally tech-agnostic; the neutral definition is on meta. The hive query is just because, since we suspect that's how we'll be generating the data, it makes sense to turn the draft def into HQL for exploratory queries and testing.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
On 15 December 2014 at 13:46, Toby Negrin <tnegrin@wikimedia.org mailto:tnegrin@wikimedia.org> wrote: I think the hive code is "representative" in that it's an implementation. It's certainly not the only permitted one.
On Dec 15, 2014, at 10:34 AM, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote:
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote: We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin <tnegrin@wikimedia.org mailto:tnegrin@wikimedia.org> wrote: Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote: Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on Friday, Dec 12th.
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to. We include searches -- WSC explicitly excludes them. We include Apps traffic -- WSC does not detect Apps traffic We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC hardcodes "/wiki/" We don't include Banner impressions -- WSC includes them. There are also some known issues with the new definition that are worth your notice:
Internal traffic is counted Note that WSC filters some internal traffic by hardcoding a set of IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard. Spider traffic is counted We will be using the User-agent field to detect and flag spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics. These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
- https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Totally!
On 15 December 2014 at 14:22, Andrew Otto aotto@wikimedia.org wrote:
Ah cool, didn’t realize there was a neutral definition. We should call that the ‘formal specification’ then.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
Hm, a UDF that does this rather than a Hive query would probably be better. E.g.
SELECT request_qualifier(uri_host), count(*) FROM wmf_raw.webrequest WHERE is_pageview(uri_host, uri_path, http_status, content_type) GROUP BY request_qualifier(uri_host) ;
Or something like that.
-Ao
On Dec 15, 2014, at 14:07, Oliver Keyes okeyes@wikimedia.org wrote:
It's totally tech-agnostic; the neutral definition is on meta. The hive query is just because, since we suspect that's how we'll be generating the data, it makes sense to turn the draft def into HQL for exploratory queries and testing.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
On 15 December 2014 at 13:46, Toby Negrin tnegrin@wikimedia.org wrote:
I think the hive code is "representative" in that it's an implementation. It's certainly not the only permitted one.
On Dec 15, 2014, at 10:34 AM, Andrew Otto aotto@wikimedia.org wrote:
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes okeyes@wikimedia.org wrote:
We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin tnegrin@wikimedia.org wrote:
Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfaker@wikimedia.org
wrote:
Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on *Friday, Dec 12th.*
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to.
- We include searches -- WSC explicitly excludes them.
- We include Apps traffic -- WSC does not detect Apps traffic
- We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/)
-- WSC hardcodes "/wiki/"
- We don't include Banner impressions -- WSC includes them.
There are also some known issues with the new definition that are worth your notice:
- *Internal traffic is counted*
- Note that WSC filters some internal traffic by hardcoding a set
of IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard.
- *Spider traffic is counted*
- We will be using the User-agent field to detect and flag
spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics.
These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on with the implementation.
On Dec 15, 2014, at 11:32 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Totally!
On 15 December 2014 at 14:22, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote: Ah cool, didn’t realize there was a neutral definition. We should call that the ‘formal specification’ then.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
Hm, a UDF that does this rather than a Hive query would probably be better. E.g.
SELECT request_qualifier(uri_host), count(*) FROM wmf_raw.webrequest WHERE is_pageview(uri_host, uri_path, http_status, content_type) GROUP BY request_qualifier(uri_host) ;
Or something like that.
-Ao
On Dec 15, 2014, at 14:07, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote:
It's totally tech-agnostic; the neutral definition is on meta. The hive query is just because, since we suspect that's how we'll be generating the data, it makes sense to turn the draft def into HQL for exploratory queries and testing.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
On 15 December 2014 at 13:46, Toby Negrin <tnegrin@wikimedia.org mailto:tnegrin@wikimedia.org> wrote: I think the hive code is "representative" in that it's an implementation. It's certainly not the only permitted one.
On Dec 15, 2014, at 10:34 AM, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote:
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote: We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin <tnegrin@wikimedia.org mailto:tnegrin@wikimedia.org> wrote: Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote: Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on Friday, Dec 12th.
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to. We include searches -- WSC explicitly excludes them. We include Apps traffic -- WSC does not detect Apps traffic We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC hardcodes "/wiki/" We don't include Banner impressions -- WSC includes them. There are also some known issues with the new definition that are worth your notice:
Internal traffic is counted Note that WSC filters some internal traffic by hardcoding a set of IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard. Spider traffic is counted We will be using the User-agent field to detect and flag spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics. These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
- https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
This needs more testing! Validation! Etc. But woo! https://gerrit.wikimedia.org/r/#/c/180023 https://gerrit.wikimedia.org/r/#/c/180023
This let’s you do:
ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
CREATE TEMPORARY FUNCTION is_pageview as 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
SELECT LOWER(uri_host) as uri_host, count(*) as pageview_count FROM wmf_raw.webrequest WHERE (webrequest_source = 'text' or webrequest_source = 'mobile') AND year=2014 AND month=12 AND day=7 AND hour=12 AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type) GROUP BY LOWER(uri_host) ORDER BY pageview_count desc LIMIT 10 ;
…
uri_host pageview_count
en.wikipedia.org 6613046 en.m.wikipedia.org 3223273 ru.wikipedia.org 2119850 ja.m.wikipedia.org 1501954 ja.wikipedia.org 1411533 de.wikipedia.org 1330252 zh.wikipedia.org 949228 fr.wikipedia.org 939602 commons.wikimedia.org 912965 de.m.wikipedia.org 664661
Time taken: 94.295 seconds, Fetched: 10 row(s)
On Dec 15, 2014, at 16:02, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on with the implementation.
On Dec 15, 2014, at 11:32 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote:
Totally!
On 15 December 2014 at 14:22, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote: Ah cool, didn’t realize there was a neutral definition. We should call that the ‘formal specification’ then.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
Hm, a UDF that does this rather than a Hive query would probably be better. E.g.
SELECT request_qualifier(uri_host), count(*) FROM wmf_raw.webrequest WHERE is_pageview(uri_host, uri_path, http_status, content_type) GROUP BY request_qualifier(uri_host) ;
Or something like that.
-Ao
On Dec 15, 2014, at 14:07, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote:
It's totally tech-agnostic; the neutral definition is on meta. The hive query is just because, since we suspect that's how we'll be generating the data, it makes sense to turn the draft def into HQL for exploratory queries and testing.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
On 15 December 2014 at 13:46, Toby Negrin <tnegrin@wikimedia.org mailto:tnegrin@wikimedia.org> wrote: I think the hive code is "representative" in that it's an implementation. It's certainly not the only permitted one.
On Dec 15, 2014, at 10:34 AM, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote:
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote: We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin <tnegrin@wikimedia.org mailto:tnegrin@wikimedia.org> wrote: Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote: Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on Friday, Dec 12th.
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to. We include searches -- WSC explicitly excludes them. We include Apps traffic -- WSC does not detect Apps traffic We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC hardcodes "/wiki/" We don't include Banner impressions -- WSC includes them. There are also some known issues with the new definition that are worth your notice:
Internal traffic is counted Note that WSC filters some internal traffic by hardcoding a set of IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard. Spider traffic is counted We will be using the User-agent field to detect and flag spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics. These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
- https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll want the output fired over to limn.
On 15 December 2014 at 19:01, Andrew Otto aotto@wikimedia.org wrote:
This needs more testing! Validation! Etc. But woo! https://gerrit.wikimedia.org/r/#/c/180023
This let’s you do:
ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
CREATE TEMPORARY FUNCTION is_pageview as 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
SELECT LOWER(uri_host) as uri_host, count(*) as pageview_count FROM wmf_raw.webrequest WHERE (webrequest_source = 'text' or webrequest_source = 'mobile') AND year=2014 AND month=12 AND day=7 AND hour=12 AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type) GROUP BY LOWER(uri_host) ORDER BY pageview_count desc LIMIT 10 ;
…
uri_host pageview_count
en.wikipedia.org 6613046 en.m.wikipedia.org 3223273 ru.wikipedia.org 2119850 ja.m.wikipedia.org 1501954 ja.wikipedia.org 1411533 de.wikipedia.org 1330252 zh.wikipedia.org 949228 fr.wikipedia.org 939602 commons.wikimedia.org 912965 de.m.wikipedia.org 664661
Time taken: 94.295 seconds, Fetched: 10 row(s)
On Dec 15, 2014, at 16:02, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on with the implementation.
On Dec 15, 2014, at 11:32 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Totally!
On 15 December 2014 at 14:22, Andrew Otto aotto@wikimedia.org wrote:
Ah cool, didn’t realize there was a neutral definition. We should call that the ‘formal specification’ then.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
Hm, a UDF that does this rather than a Hive query would probably be better. E.g.
SELECT request_qualifier(uri_host), count(*) FROM wmf_raw.webrequest WHERE is_pageview(uri_host, uri_path, http_status, content_type) GROUP BY request_qualifier(uri_host) ;
Or something like that.
-Ao
On Dec 15, 2014, at 14:07, Oliver Keyes okeyes@wikimedia.org wrote:
It's totally tech-agnostic; the neutral definition is on meta. The hive query is just because, since we suspect that's how we'll be generating the data, it makes sense to turn the draft def into HQL for exploratory queries and testing.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
On 15 December 2014 at 13:46, Toby Negrin tnegrin@wikimedia.org wrote:
I think the hive code is "representative" in that it's an implementation. It's certainly not the only permitted one.
On Dec 15, 2014, at 10:34 AM, Andrew Otto aotto@wikimedia.org wrote:
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes okeyes@wikimedia.org wrote:
We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin tnegrin@wikimedia.org wrote:
Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker < ahalfaker@wikimedia.org> wrote:
Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on *Friday, Dec 12th.*
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to.
- We include searches -- WSC explicitly excludes them.
- We include Apps traffic -- WSC does not detect Apps traffic
- We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/)
-- WSC hardcodes "/wiki/"
- We don't include Banner impressions -- WSC includes them.
There are also some known issues with the new definition that are worth your notice:
- *Internal traffic is counted*
- Note that WSC filters some internal traffic by hardcoding a set
of IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard.
- *Spider traffic is counted*
- We will be using the User-agent field to detect and flag
spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics.
These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Note that Andrew's example code doesn't run (at least, for me) because it needs to be:
CREATE TEMPORARY FUNCTION is_pageview as "org.wikimedia.analytics.refinery.hive.IsPageviewUDF";
Hive gets stupider every time I try to use it ;p
On 15 December 2014 at 20:47, Oliver Keyes okeyes@wikimedia.org wrote:
Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll want the output fired over to limn.
On 15 December 2014 at 19:01, Andrew Otto aotto@wikimedia.org wrote:
This needs more testing! Validation! Etc. But woo! https://gerrit.wikimedia.org/r/#/c/180023
This let’s you do:
ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
CREATE TEMPORARY FUNCTION is_pageview as 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
SELECT LOWER(uri_host) as uri_host, count(*) as pageview_count FROM wmf_raw.webrequest WHERE (webrequest_source = 'text' or webrequest_source = 'mobile') AND year=2014 AND month=12 AND day=7 AND hour=12 AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type) GROUP BY LOWER(uri_host) ORDER BY pageview_count desc LIMIT 10 ;
…
uri_host pageview_count
en.wikipedia.org 6613046 en.m.wikipedia.org 3223273 ru.wikipedia.org 2119850 ja.m.wikipedia.org 1501954 ja.wikipedia.org 1411533 de.wikipedia.org 1330252 zh.wikipedia.org 949228 fr.wikipedia.org 939602 commons.wikimedia.org 912965 de.m.wikipedia.org 664661
Time taken: 94.295 seconds, Fetched: 10 row(s)
On Dec 15, 2014, at 16:02, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on with the implementation.
On Dec 15, 2014, at 11:32 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Totally!
On 15 December 2014 at 14:22, Andrew Otto aotto@wikimedia.org wrote:
Ah cool, didn’t realize there was a neutral definition. We should call that the ‘formal specification’ then.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
Hm, a UDF that does this rather than a Hive query would probably be better. E.g.
SELECT request_qualifier(uri_host), count(*) FROM wmf_raw.webrequest WHERE is_pageview(uri_host, uri_path, http_status, content_type) GROUP BY request_qualifier(uri_host) ;
Or something like that.
-Ao
On Dec 15, 2014, at 14:07, Oliver Keyes okeyes@wikimedia.org wrote:
It's totally tech-agnostic; the neutral definition is on meta. The hive query is just because, since we suspect that's how we'll be generating the data, it makes sense to turn the draft def into HQL for exploratory queries and testing.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
On 15 December 2014 at 13:46, Toby Negrin tnegrin@wikimedia.org wrote:
I think the hive code is "representative" in that it's an implementation. It's certainly not the only permitted one.
On Dec 15, 2014, at 10:34 AM, Andrew Otto aotto@wikimedia.org wrote:
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes okeyes@wikimedia.org wrote:
We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin tnegrin@wikimedia.org wrote:
Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker < ahalfaker@wikimedia.org> wrote:
> Hey folks, > > As discussions on the new page view definition have been calming > down, we're preparing to deliver a draft version to the Devs. I want to > make sure that we all know the status and that any substantial concerns are > raised before we hand things off on *Friday, Dec 12th.* > > For this phase, we are delivering the general filter[1]. This is > the highest level filter, and exists primarily to distinguish requests > worthy of further evaluation. Our plan is to take the definition as it > exists on the 12th, and begin generating high-level aggregate numbers based > on it. In future iterations, we will be digging into different breakdowns > of this metric, and iterating on it to handle any inconsistencies or > unexpected results. There's a few differences from Web Stat Collector's > (WSC) version of the general filter that we want to call to your attention > to. > > - We include searches -- WSC explicitly excludes them. > - We include Apps traffic -- WSC does not detect Apps traffic > - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) > -- WSC hardcodes "/wiki/" > - We don't include Banner impressions -- WSC includes them. > > There are also some known issues with the new definition that are > worth your notice: > > > 1. *Internal traffic is counted* > > > - Note that WSC filters some internal traffic by hardcoding a > set of IPs in the definition. We are working on parsing puppet templates > in order to automatically detect which IPs represent internal traffic. > This will be a /better/ solution, but it's not quite ready yet because > parsing puppet is hard. > > > 1. *Spider traffic is counted* > > > - We will be using the User-agent field to detect and flag > spider-based traffic. This "tag definition" will be delivered in a > subsequent definition. This actually matches WSC, which does not filter > spider for the high-level metrics. > > These are problems we're aware of, and will be factoring in as we go > forward with our next task: refining the definition using real, > hourly-level traffic data. Thanks to everyone who has given feedback and > participated in the process thus far, particularly Nemo, Erik, and > Christian. > > 1. > https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters > > -Aaron & Oliver > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Note that in Oliver's example, the quotes are double quotes, not single quotes. I didn't see the difference immediately.
-Toby
On Tue, Dec 16, 2014 at 6:22 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Note that Andrew's example code doesn't run (at least, for me) because it needs to be:
CREATE TEMPORARY FUNCTION is_pageview as "org.wikimedia.analytics.refinery.hive.IsPageviewUDF";
Hive gets stupider every time I try to use it ;p
On 15 December 2014 at 20:47, Oliver Keyes okeyes@wikimedia.org wrote:
Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll want the output fired over to limn.
On 15 December 2014 at 19:01, Andrew Otto aotto@wikimedia.org wrote:
This needs more testing! Validation! Etc. But woo! https://gerrit.wikimedia.org/r/#/c/180023
This let’s you do:
ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
CREATE TEMPORARY FUNCTION is_pageview as 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
SELECT LOWER(uri_host) as uri_host, count(*) as pageview_count FROM wmf_raw.webrequest WHERE (webrequest_source = 'text' or webrequest_source = 'mobile') AND year=2014 AND month=12 AND day=7 AND hour=12 AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type) GROUP BY LOWER(uri_host) ORDER BY pageview_count desc LIMIT 10 ;
…
uri_host pageview_count
en.wikipedia.org 6613046 en.m.wikipedia.org 3223273 ru.wikipedia.org 2119850 ja.m.wikipedia.org 1501954 ja.wikipedia.org 1411533 de.wikipedia.org 1330252 zh.wikipedia.org 949228 fr.wikipedia.org 939602 commons.wikimedia.org 912965 de.m.wikipedia.org 664661
Time taken: 94.295 seconds, Fetched: 10 row(s)
On Dec 15, 2014, at 16:02, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on with the implementation.
On Dec 15, 2014, at 11:32 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Totally!
On 15 December 2014 at 14:22, Andrew Otto aotto@wikimedia.org wrote:
Ah cool, didn’t realize there was a neutral definition. We should call that the ‘formal specification’ then.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
Hm, a UDF that does this rather than a Hive query would probably be better. E.g.
SELECT request_qualifier(uri_host), count(*) FROM wmf_raw.webrequest WHERE is_pageview(uri_host, uri_path, http_status, content_type) GROUP BY request_qualifier(uri_host) ;
Or something like that.
-Ao
On Dec 15, 2014, at 14:07, Oliver Keyes okeyes@wikimedia.org wrote:
It's totally tech-agnostic; the neutral definition is on meta. The hive query is just because, since we suspect that's how we'll be generating the data, it makes sense to turn the draft def into HQL for exploratory queries and testing.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
On 15 December 2014 at 13:46, Toby Negrin tnegrin@wikimedia.org wrote:
I think the hive code is "representative" in that it's an implementation. It's certainly not the only permitted one.
On Dec 15, 2014, at 10:34 AM, Andrew Otto aotto@wikimedia.org wrote:
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes okeyes@wikimedia.org wrote:
We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin tnegrin@wikimedia.org wrote: > > Hi Aaron, all -- > > I haven't seen any discussion on this which is a sign that we can > forward with turning over the draft. Thoughts? > > thanks, > > -Toby > > On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker < > ahalfaker@wikimedia.org> wrote: > >> Hey folks, >> >> As discussions on the new page view definition have been calming >> down, we're preparing to deliver a draft version to the Devs. I want to >> make sure that we all know the status and that any substantial concerns are >> raised before we hand things off on *Friday, Dec 12th.* >> >> For this phase, we are delivering the general filter[1]. This is >> the highest level filter, and exists primarily to distinguish requests >> worthy of further evaluation. Our plan is to take the definition as it >> exists on the 12th, and begin generating high-level aggregate numbers based >> on it. In future iterations, we will be digging into different breakdowns >> of this metric, and iterating on it to handle any inconsistencies or >> unexpected results. There's a few differences from Web Stat Collector's >> (WSC) version of the general filter that we want to call to your attention >> to. >> >> - We include searches -- WSC explicitly excludes them. >> - We include Apps traffic -- WSC does not detect Apps traffic >> - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, >> /sr-ec/) -- WSC hardcodes "/wiki/" >> - We don't include Banner impressions -- WSC includes them. >> >> There are also some known issues with the new definition that are >> worth your notice: >> >> >> 1. *Internal traffic is counted* >> >> >> - Note that WSC filters some internal traffic by hardcoding a >> set of IPs in the definition. We are working on parsing puppet templates >> in order to automatically detect which IPs represent internal traffic. >> This will be a /better/ solution, but it's not quite ready yet because >> parsing puppet is hard. >> >> >> 1. *Spider traffic is counted* >> >> >> - We will be using the User-agent field to detect and flag >> spider-based traffic. This "tag definition" will be delivered in a >> subsequent definition. This actually matches WSC, which does not filter >> spider for the high-level metrics. >> >> These are problems we're aware of, and will be factoring in as we >> go forward with our next task: refining the definition using real, >> hourly-level traffic data. Thanks to everyone who has given feedback and >> participated in the process thus far, particularly Nemo, Erik, and >> Christian. >> >> 1. >> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >> >> -Aaron & Oliver >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
That might be a email copy/paste problem; I see the non-symetrical quotes in my email.
On Dec 16, 2014, at 09:38, Toby Negrin tnegrin@wikimedia.org wrote:
Note that in Oliver's example, the quotes are double quotes, not single quotes. I didn't see the difference immediately.
-Toby
On Tue, Dec 16, 2014 at 6:22 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote: Note that Andrew's example code doesn't run (at least, for me) because it needs to be:
CREATE TEMPORARY FUNCTION is_pageview as "org.wikimedia.analytics.refinery.hive.IsPageviewUDF";
Hive gets stupider every time I try to use it ;p
On 15 December 2014 at 20:47, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote: Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll want the output fired over to limn.
On 15 December 2014 at 19:01, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote: This needs more testing! Validation! Etc. But woo! https://gerrit.wikimedia.org/r/#/c/180023 https://gerrit.wikimedia.org/r/#/c/180023
This let’s you do:
ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
CREATE TEMPORARY FUNCTION is_pageview as 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
SELECT LOWER(uri_host) as uri_host, count(*) as pageview_count FROM wmf_raw.webrequest WHERE (webrequest_source = 'text' or webrequest_source = 'mobile') AND year=2014 AND month=12 AND day=7 AND hour=12 AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type) GROUP BY LOWER(uri_host) ORDER BY pageview_count desc LIMIT 10 ;
…
uri_host pageview_count
en.wikipedia.org http://en.wikipedia.org/ 6613046 en.m.wikipedia.org http://en.m.wikipedia.org/ 3223273 ru.wikipedia.org http://ru.wikipedia.org/ 2119850 ja.m.wikipedia.org http://ja.m.wikipedia.org/ 1501954 ja.wikipedia.org http://ja.wikipedia.org/ 1411533 de.wikipedia.org http://de.wikipedia.org/ 1330252 zh.wikipedia.org http://zh.wikipedia.org/ 949228 fr.wikipedia.org http://fr.wikipedia.org/ 939602 commons.wikimedia.org http://commons.wikimedia.org/ 912965 de.m.wikipedia.org http://de.m.wikipedia.org/ 664661
Time taken: 94.295 seconds, Fetched: 10 row(s)
On Dec 15, 2014, at 16:02, Dario Taraborelli <dtaraborelli@wikimedia.org mailto:dtaraborelli@wikimedia.org> wrote:
Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on with the implementation.
On Dec 15, 2014, at 11:32 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote:
Totally!
On 15 December 2014 at 14:22, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote: Ah cool, didn’t realize there was a neutral definition. We should call that the ‘formal specification’ then.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
Hm, a UDF that does this rather than a Hive query would probably be better. E.g.
SELECT request_qualifier(uri_host), count(*) FROM wmf_raw.webrequest WHERE is_pageview(uri_host, uri_path, http_status, content_type) GROUP BY request_qualifier(uri_host) ;
Or something like that.
-Ao
On Dec 15, 2014, at 14:07, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote:
It's totally tech-agnostic; the neutral definition is on meta. The hive query is just because, since we suspect that's how we'll be generating the data, it makes sense to turn the draft def into HQL for exploratory queries and testing.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
On 15 December 2014 at 13:46, Toby Negrin <tnegrin@wikimedia.org mailto:tnegrin@wikimedia.org> wrote: I think the hive code is "representative" in that it's an implementation. It's certainly not the only permitted one.
On Dec 15, 2014, at 10:34 AM, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote:
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote: We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin <tnegrin@wikimedia.org mailto:tnegrin@wikimedia.org> wrote: Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote: Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on Friday, Dec 12th.
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to. We include searches -- WSC explicitly excludes them. We include Apps traffic -- WSC does not detect Apps traffic We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC hardcodes "/wiki/" We don't include Banner impressions -- WSC includes them. There are also some known issues with the new definition that are worth your notice:
Internal traffic is counted Note that WSC filters some internal traffic by hardcoding a set of IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard. Spider traffic is counted We will be using the User-agent field to detect and flag spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics. These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
- https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
FAILED: SemanticException [Error 10014]: Line 12:8 Wrong arguments 'content_type': No matching method for class org.wikimedia.analytics.refinery.hive.IsPageviewUDF with (string, string, string, string, string). Possible choices: _FUNC_(string, string, string, string, string, string)
whut
On 16 December 2014 at 10:08, Andrew Otto aotto@wikimedia.org wrote:
That might be a email copy/paste problem; I see the non-symetrical quotes in my email.
On Dec 16, 2014, at 09:38, Toby Negrin tnegrin@wikimedia.org wrote:
Note that in Oliver's example, the quotes are double quotes, not single quotes. I didn't see the difference immediately.
-Toby
On Tue, Dec 16, 2014 at 6:22 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Note that Andrew's example code doesn't run (at least, for me) because it needs to be:
CREATE TEMPORARY FUNCTION is_pageview as "org.wikimedia.analytics.refinery.hive.IsPageviewUDF";
Hive gets stupider every time I try to use it ;p
On 15 December 2014 at 20:47, Oliver Keyes okeyes@wikimedia.org wrote:
Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll want the output fired over to limn.
On 15 December 2014 at 19:01, Andrew Otto aotto@wikimedia.org wrote:
This needs more testing! Validation! Etc. But woo! https://gerrit.wikimedia.org/r/#/c/180023
This let’s you do:
ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
CREATE TEMPORARY FUNCTION is_pageview as 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
SELECT LOWER(uri_host) as uri_host, count(*) as pageview_count FROM wmf_raw.webrequest WHERE (webrequest_source = 'text' or webrequest_source = 'mobile') AND year=2014 AND month=12 AND day=7 AND hour=12 AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type) GROUP BY LOWER(uri_host) ORDER BY pageview_count desc LIMIT 10 ;
…
uri_host pageview_count
en.wikipedia.org 6613046 en.m.wikipedia.org 3223273 ru.wikipedia.org 2119850 ja.m.wikipedia.org 1501954 ja.wikipedia.org 1411533 de.wikipedia.org 1330252 zh.wikipedia.org 949228 fr.wikipedia.org 939602 commons.wikimedia.org 912965 de.m.wikipedia.org 664661
Time taken: 94.295 seconds, Fetched: 10 row(s)
On Dec 15, 2014, at 16:02, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on with the implementation.
On Dec 15, 2014, at 11:32 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Totally!
On 15 December 2014 at 14:22, Andrew Otto aotto@wikimedia.org wrote:
Ah cool, didn’t realize there was a neutral definition. We should call that the ‘formal specification’ then.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
Hm, a UDF that does this rather than a Hive query would probably be better. E.g.
SELECT request_qualifier(uri_host), count(*) FROM wmf_raw.webrequest WHERE is_pageview(uri_host, uri_path, http_status, content_type) GROUP BY request_qualifier(uri_host) ;
Or something like that.
-Ao
On Dec 15, 2014, at 14:07, Oliver Keyes okeyes@wikimedia.org wrote:
It's totally tech-agnostic; the neutral definition is on meta. The hive query is just because, since we suspect that's how we'll be generating the data, it makes sense to turn the draft def into HQL for exploratory queries and testing.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
On 15 December 2014 at 13:46, Toby Negrin tnegrin@wikimedia.org wrote:
I think the hive code is "representative" in that it's an implementation. It's certainly not the only permitted one.
On Dec 15, 2014, at 10:34 AM, Andrew Otto aotto@wikimedia.org wrote:
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes okeyes@wikimedia.org wrote:
> We've written the draft Hive queries and I'm reviewing them with > Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll > work it through :). > > On 15 December 2014 at 12:10, Toby Negrin tnegrin@wikimedia.org > wrote: >> >> Hi Aaron, all -- >> >> I haven't seen any discussion on this which is a sign that we can >> forward with turning over the draft. Thoughts? >> >> thanks, >> >> -Toby >> >> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker < >> ahalfaker@wikimedia.org> wrote: >> >>> Hey folks, >>> >>> As discussions on the new page view definition have been calming >>> down, we're preparing to deliver a draft version to the Devs. I want to >>> make sure that we all know the status and that any substantial concerns are >>> raised before we hand things off on *Friday, Dec 12th.* >>> >>> For this phase, we are delivering the general filter[1]. This is >>> the highest level filter, and exists primarily to distinguish requests >>> worthy of further evaluation. Our plan is to take the definition as it >>> exists on the 12th, and begin generating high-level aggregate numbers based >>> on it. In future iterations, we will be digging into different breakdowns >>> of this metric, and iterating on it to handle any inconsistencies or >>> unexpected results. There's a few differences from Web Stat Collector's >>> (WSC) version of the general filter that we want to call to your attention >>> to. >>> >>> - We include searches -- WSC explicitly excludes them. >>> - We include Apps traffic -- WSC does not detect Apps traffic >>> - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, >>> /sr-ec/) -- WSC hardcodes "/wiki/" >>> - We don't include Banner impressions -- WSC includes them. >>> >>> There are also some known issues with the new definition that are >>> worth your notice: >>> >>> >>> 1. *Internal traffic is counted* >>> >>> >>> - Note that WSC filters some internal traffic by hardcoding a >>> set of IPs in the definition. We are working on parsing puppet templates >>> in order to automatically detect which IPs represent internal traffic. >>> This will be a /better/ solution, but it's not quite ready yet because >>> parsing puppet is hard. >>> >>> >>> 1. *Spider traffic is counted* >>> >>> >>> - We will be using the User-agent field to detect and flag >>> spider-based traffic. This "tag definition" will be delivered in a >>> subsequent definition. This actually matches WSC, which does not filter >>> spider for the high-level metrics. >>> >>> These are problems we're aware of, and will be factoring in as we >>> go forward with our next task: refining the definition using real, >>> hourly-level traffic data. Thanks to everyone who has given feedback and >>> participated in the process thus far, particularly Nemo, Erik, and >>> Christian. >>> >>> 1. >>> https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters >>> >>> -Aaron & Oliver >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Oliver,
On Wed, Dec 24, 2014 at 03:23:39PM -0500, Oliver Keyes wrote:
FAILED: SemanticException [Error 10014]: Line 12:8 Wrong arguments 'content_type': No matching method for class org.wikimedia.analytics.refinery.hive.IsPageviewUDF with (string, string, string, string, string). Possible choices: _FUNC_(string, string, string, string, string, string)
/me puts on his augur hat, since your email is lacking a way to reproduce the issue.
Maybe you passed the wrong number of arguments (5 strings vs. the 6 needed strings)?
(Count the "string"s in the error message)
Have fun, Christian
Thanks Christian!
Just to update everyone: as the above thread has suggested, we're in the process of getting UDFs - easy, callable functions - written for the old and new pageviews definitions. Once they're in, we'll be able to (amongst other things) incredibly conveniently query the database for data, work out whether what it spits out is what it *should* be spitting out, and tweak accordingly. It'll also drastically simplify the code necessary to query hive for "all the pageviews that have [attribute]".
On 25 December 2014 at 08:05, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi Oliver,
On Wed, Dec 24, 2014 at 03:23:39PM -0500, Oliver Keyes wrote:
FAILED: SemanticException [Error 10014]: Line 12:8 Wrong arguments 'content_type': No matching method for class org.wikimedia.analytics.refinery.hive.IsPageviewUDF with (string, string, string, string, string). Possible choices: _FUNC_(string, string, string, string, string, string)
/me puts on his augur hat, since your email is lacking a way to reproduce the issue.
Maybe you passed the wrong number of arguments (5 strings vs. the 6 needed strings)?
(Count the "string"s in the error message)
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics