Yay! Will validate/patch/poke tomorrow :). If it works, presumably we'll
want the output fired over to limn.
On 15 December 2014 at 19:01, Andrew Otto <aotto(a)wikimedia.org> wrote:
This needs more testing! Validation! Etc. But woo!
https://gerrit.wikimedia.org/r/#/c/180023
This let’s you do:
ADD JAR /home/otto/refinery-hive-0.0.3-pageview.jar;
CREATE TEMPORARY FUNCTION is_pageview as
'org.wikimedia.analytics.refinery.hive.IsPageviewUDF’;
SELECT
LOWER(uri_host) as uri_host,
count(*) as pageview_count
FROM
wmf_raw.webrequest
WHERE
(webrequest_source = 'text' or webrequest_source = 'mobile')
AND year=2014
AND month=12
AND day=7
AND hour=12
AND is_pageview(LOWER(uri_host), uri_path, http_status,
content_type)
GROUP BY
LOWER(uri_host)
ORDER BY pageview_count desc
LIMIT 10
;
…
uri_host pageview_count
en.wikipedia.org 6613046
en.m.wikipedia.org 3223273
ru.wikipedia.org 2119850
ja.m.wikipedia.org 1501954
ja.wikipedia.org 1411533
de.wikipedia.org 1330252
zh.wikipedia.org 949228
fr.wikipedia.org 939602
commons.wikimedia.org 912965
de.m.wikipedia.org 664661
Time taken: 94.295 seconds, Fetched: 10 row(s)
On Dec 15, 2014, at 16:02, Dario Taraborelli <
dtaraborelli(a)wikimedia.org> wrote:
Oliver, Aaron – thanks for pushing this forward! Glad that we’re moving
on with the implementation.
On Dec 15, 2014, at 11:32 AM, Oliver Keyes <okeyes(a)wikimedia.org>
wrote:
Totally!
On 15 December 2014 at 14:22, Andrew Otto <aotto(a)wikimedia.org> wrote:
>
> Ah cool, didn’t realize there was a neutral definition. We should
> call that the ‘formal specification’ then.
>
> ...of course, now that I've said that, cosmic irony demands we end up
> implementing in C, or something.
>
> Hm, a UDF that does this rather than a Hive query would probably be
> better. E.g.
>
> SELECT
> request_qualifier(uri_host),
> count(*)
> FROM
> wmf_raw.webrequest
> WHERE
> is_pageview(uri_host, uri_path, http_status, content_type)
> GROUP BY
> request_qualifier(uri_host)
> ;
>
>
> Or something like that.
>
> -Ao
>
>
>
>
>
>
> On Dec 15, 2014, at 14:07, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
>
> It's totally tech-agnostic; the neutral definition is on meta. The
> hive query is just because, since we suspect that's how we'll be generating
> the data, it makes sense to turn the draft def into HQL for exploratory
> queries and testing.
>
> ...of course, now that I've said that, cosmic irony demands we end up
> implementing in C, or something.
>
> On 15 December 2014 at 13:46, Toby Negrin <tnegrin(a)wikimedia.org>
> wrote:
>>
>> I think the hive code is "representative" in that it's an
>> implementation. It's certainly not the only permitted one.
>>
>> On Dec 15, 2014, at 10:34 AM, Andrew Otto <aotto(a)wikimedia.org>
>> wrote:
>>
>> We're moving forward to generate Hive queries that will represent
>> the formal specification.
>>
>> Should a specific implementation (e.g. Hive) represent the formal
>> specification? I tend to think it should be tech-agnostic, no?
>>
>>
>>
>> On Dec 15, 2014, at 12:15, Aaron Halfaker <ahalfaker(a)wikimedia.org>
>> wrote:
>>
>> Toby, that's right. We're moving forward to generate Hive queries
>> that will represent the formal specification.
>>
>> -Aaron
>>
>> On Mon, Dec 15, 2014 at 9:12 AM, Oliver Keyes <okeyes(a)wikimedia.org>
>> wrote:
>>
>>> We've written the draft Hive queries and I'm reviewing them with
>>> Otto now. Currently blocked on Hadoop heapsize issues, but I'm sure
we'll
>>> work it through :).
>>>
>>> On 15 December 2014 at 12:10, Toby Negrin <tnegrin(a)wikimedia.org>
>>> wrote:
>>>>
>>>> Hi Aaron, all --
>>>>
>>>> I haven't seen any discussion on this which is a sign that we can
>>>> forward with turning over the draft. Thoughts?
>>>>
>>>> thanks,
>>>>
>>>> -Toby
>>>>
>>>> On Tue, Dec 9, 2014 at 5:15 PM, Aaron Halfaker <
>>>> ahalfaker(a)wikimedia.org> wrote:
>>>>
>>>>> Hey folks,
>>>>>
>>>>> As discussions on the new page view definition have been calming
>>>>> down, we're preparing to deliver a draft version to the Devs. I
want to
>>>>> make sure that we all know the status and that any substantial
concerns are
>>>>> raised before we hand things off on *Friday, Dec 12th.*
>>>>>
>>>>> For this phase, we are delivering the general filter[1]. This is
>>>>> the highest level filter, and exists primarily to distinguish
requests
>>>>> worthy of further evaluation. Our plan is to take the definition as
it
>>>>> exists on the 12th, and begin generating high-level aggregate numbers
based
>>>>> on it. In future iterations, we will be digging into different
breakdowns
>>>>> of this metric, and iterating on it to handle any inconsistencies or
>>>>> unexpected results. There's a few differences from Web Stat
Collector's
>>>>> (WSC) version of the general filter that we want to call to your
attention
>>>>> to.
>>>>>
>>>>> - We include searches -- WSC explicitly excludes them.
>>>>> - We include Apps traffic -- WSC does not detect Apps traffic
>>>>> - We include variants of /wiki/ (e.g. /zh-tw/, /zh-cn/,
>>>>> /sr-ec/) -- WSC hardcodes "/wiki/"
>>>>> - We don't include Banner impressions -- WSC includes them.
>>>>>
>>>>> There are also some known issues with the new definition that are
>>>>> worth your notice:
>>>>>
>>>>>
>>>>> 1. *Internal traffic is counted*
>>>>>
>>>>>
>>>>> - Note that WSC filters some internal traffic by hardcoding a
>>>>> set of IPs in the definition. We are working on parsing puppet
templates
>>>>> in order to automatically detect which IPs represent internal
traffic.
>>>>> This will be a /better/ solution, but it's not quite ready yet
because
>>>>> parsing puppet is hard.
>>>>>
>>>>>
>>>>> 1. *Spider traffic is counted*
>>>>>
>>>>>
>>>>> - We will be using the User-agent field to detect and flag
>>>>> spider-based traffic. This "tag definition" will be
delivered in a
>>>>> subsequent definition. This actually matches WSC, which does not
filter
>>>>> spider for the high-level metrics.
>>>>>
>>>>> These are problems we're aware of, and will be factoring in as
we
>>>>> go forward with our next task: refining the definition using real,
>>>>> hourly-level traffic data. Thanks to everyone who has given feedback
and
>>>>> participated in the process thus far, particularly Nemo, Erik, and
>>>>> Christian.
>>>>>
>>>>> 1.
>>>>>
https://meta.wikimedia.org/wiki/Research:Page_view/Generalised_filters
>>>>>
>>>>> -Aaron & Oliver
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> Analytics(a)lists.wikimedia.org
>>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics(a)lists.wikimedia.org
>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics