This needs more testing! Validation! Etc. But woo! https://gerrit.wikimedia.org/r/#/c/180023 https://gerrit.wikimedia.org/r/#/c/180023
This let’s you do:
ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
CREATE TEMPORARY FUNCTION is_pageview as 'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
SELECT LOWER(uri_host) as uri_host, count(*) as pageview_count FROM wmf_raw.webrequest WHERE (webrequest_source = 'text' or webrequest_source = 'mobile') AND year=2014 AND month=12 AND day=7 AND hour=12 AND is_pageview(LOWER(uri_host), uri_path, http_status, content_type) GROUP BY LOWER(uri_host) ORDER BY pageview_count desc LIMIT 10 ;
…
uri_host pageview_count
en.wikipedia.org 6613046 en.m.wikipedia.org 3223273 ru.wikipedia.org 2119850 ja.m.wikipedia.org 1501954 ja.wikipedia.org 1411533 de.wikipedia.org 1330252 zh.wikipedia.org 949228 fr.wikipedia.org 939602 commons.wikimedia.org 912965 de.m.wikipedia.org 664661
Time taken: 94.295 seconds, Fetched: 10 row(s)
On Dec 15, 2014, at 16:02, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving on with the implementation.
On Dec 15, 2014, at 11:32 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote:
Totally!
On 15 December 2014 at 14:22, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote: Ah cool, didn’t realize there was a neutral definition. We should call that the ‘formal specification’ then.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
Hm, a UDF that does this rather than a Hive query would probably be better. E.g.
SELECT request_qualifier(uri_host), count(*) FROM wmf_raw.webrequest WHERE is_pageview(uri_host, uri_path, http_status, content_type) GROUP BY request_qualifier(uri_host) ;
Or something like that.
-Ao
On Dec 15, 2014, at 14:07, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote:
It's totally tech-agnostic; the neutral definition is on meta. The hive query is just because, since we suspect that's how we'll be generating the data, it makes sense to turn the draft def into HQL for exploratory queries and testing.
...of course, now that I've said that, cosmic irony demands we end up implementing in C, or something.
On 15 December 2014 at 13:46, Toby Negrin <tnegrin@wikimedia.org mailto:tnegrin@wikimedia.org> wrote: I think the hive code is "representative" in that it's an implementation. It's certainly not the only permitted one.
On Dec 15, 2014, at 10:34 AM, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote:
We're moving forward to generate Hive queries that will represent the formal specification.
Should a specific implementation (e.g. Hive) represent the formal specification? I tend to think it should be tech-agnostic, no?
On Dec 15, 2014, at 12:15, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote:
Toby, that's right. We're moving forward to generate Hive queries that will represent the formal specification.
-Aaron
On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote: We've written the draft Hive queries and I'm reviewing them with Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure we'll work it through :).
On 15 December 2014 at 12:10, Toby Negrin <tnegrin@wikimedia.org mailto:tnegrin@wikimedia.org> wrote: Hi Aaron, all --
I haven't seen any discussion on this which is a sign that we can forward with turning over the draft. Thoughts?
thanks,
-Toby
On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote: Hey folks,
As discussions on the new page view definition have been calming down, we're preparing to deliver a draft version to the Devs. I want to make sure that we all know the status and that any substantial concerns are raised before we hand things off on Friday, Dec 12th.
For this phase, we are delivering the general filter[1]. This is the highest level filter, and exists primarily to distinguish requests worthy of further evaluation. Our plan is to take the definition as it exists on the 12th, and begin generating high-level aggregate numbers based on it. In future iterations, we will be digging into different breakdowns of this metric, and iterating on it to handle any inconsistencies or unexpected results. There's a few differences from Web Stat Collector's (WSC) version of the general filter that we want to call to your attention to. We include searches -- WSC explicitly excludes them. We include Apps traffic -- WSC does not detect Apps traffic We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/, /sr-ec/) -- WSC hardcodes "/wiki/" We don't include Banner impressions -- WSC includes them. There are also some known issues with the new definition that are worth your notice:
Internal traffic is counted Note that WSC filters some internal traffic by hardcoding a set of IPs in the definition. We are working on parsing puppet templates in order to automatically detect which IPs represent internal traffic. This will be a /better/ solution, but it's not quite ready yet because parsing puppet is hard. Spider traffic is counted We will be using the User-agent field to detect and flag spider-based traffic. This "tag definition" will be delivered in a subsequent definition. This actually matches WSC, which does not filter spider for the high-level metrics. These are problems we're aware of, and will be factoring in as we go forward with our next task: refining the definition using real, hourly-level traffic data. Thanks to everyone who has given feedback and participated in the process thus far, particularly Nemo, Erik, and Christian.
- https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
-Aaron & Oliver
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics