I'm pleased to say we now have the prototype pageviews definition as a UDF!
For those with cluster access:
CREATE TEMPORARY FUNCTION pageview as 'org.wikimedia.analytics.refinery.hive.isPageviewUDF';
...and then just apply it. It outputs a boolean, so you can easily go WHERE is.Pageview(fields) and treat it as a conditional. Great success!
What this means for the definition is twofold; it means it's a lot easier to tests it accuracy, and it means that it's a lot easier to make sure we're all using the same definition going forward. Once we have the legacy definition as a UDF, refining and testing will proceed at great speed, although I encourage anyone with time on their hands who wants to help out to do some testing of their own :)
Cool! Let's say I want to review the filters and apply them in a python script. What should I reference?
On Wed, Jan 7, 2015 at 5:13 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I'm pleased to say we now have the prototype pageviews definition as a UDF!
For those with cluster access:
CREATE TEMPORARY FUNCTION pageview as 'org.wikimedia.analytics.refinery.hive.isPageviewUDF';
...and then just apply it. It outputs a boolean, so you can easily go WHERE is.Pageview(fields) and treat it as a conditional. Great success!
What this means for the definition is twofold; it means it's a lot easier to tests it accuracy, and it means that it's a lot easier to make sure we're all using the same definition going forward. Once we have the legacy definition as a UDF, refining and testing will proceed at great speed, although I encourage anyone with time on their hands who wants to help out to do some testing of their own :)
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I am not sure if this is quite what you are asking but just in case:
For streaming is probably easier for you to use the newly created webrequest tables:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Webrequest_Table....
Those include an isPageview field so requests are pre-classified. You will need to wait a bit as data for those tables is being populated starting today.
On Wed, Jan 7, 2015 at 3:35 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Cool! Let's say I want to review the filters and apply them in a python script. What should I reference?
On Wed, Jan 7, 2015 at 5:13 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I'm pleased to say we now have the prototype pageviews definition as a UDF!
For those with cluster access:
CREATE TEMPORARY FUNCTION pageview as 'org.wikimedia.analytics.refinery.hive.isPageviewUDF';
...and then just apply it. It outputs a boolean, so you can easily go WHERE is.Pageview(fields) and treat it as a conditional. Great success!
What this means for the definition is twofold; it means it's a lot easier to tests it accuracy, and it means that it's a lot easier to make sure we're all using the same definition going forward. Once we have the legacy definition as a UDF, refining and testing will proceed at great speed, although I encourage anyone with time on their hands who wants to help out to do some testing of their own :)
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
That's great and it will serve most of my use cases. Any chance we can get that field added to the sampled logs & hourly counts?
On Wed, Jan 7, 2015 at 5:40 PM, Nuria Ruiz nuria@wikimedia.org wrote:
I am not sure if this is quite what you are asking but just in case:
For streaming is probably easier for you to use the newly created webrequest tables:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Webrequest_Table....
Those include an isPageview field so requests are pre-classified. You will need to wait a bit as data for those tables is being populated starting today.
On Wed, Jan 7, 2015 at 3:35 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Cool! Let's say I want to review the filters and apply them in a python script. What should I reference?
On Wed, Jan 7, 2015 at 5:13 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I'm pleased to say we now have the prototype pageviews definition as a UDF!
For those with cluster access:
CREATE TEMPORARY FUNCTION pageview as 'org.wikimedia.analytics.refinery.hive.isPageviewUDF';
...and then just apply it. It outputs a boolean, so you can easily go WHERE is.Pageview(fields) and treat it as a conditional. Great success!
What this means for the definition is twofold; it means it's a lot easier to tests it accuracy, and it means that it's a lot easier to make sure we're all using the same definition going forward. Once we have the legacy definition as a UDF, refining and testing will proceed at great speed, although I encourage anyone with time on their hands who wants to help out to do some testing of their own :)
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Just realized that hourly counts won't need it -- because they'll be generated from page views anyway!
On Wed, Jan 7, 2015 at 5:41 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
That's great and it will serve most of my use cases. Any chance we can get that field added to the sampled logs & hourly counts?
On Wed, Jan 7, 2015 at 5:40 PM, Nuria Ruiz nuria@wikimedia.org wrote:
I am not sure if this is quite what you are asking but just in case:
For streaming is probably easier for you to use the newly created webrequest tables:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Webrequest_Table....
Those include an isPageview field so requests are pre-classified. You will need to wait a bit as data for those tables is being populated starting today.
On Wed, Jan 7, 2015 at 3:35 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Cool! Let's say I want to review the filters and apply them in a python script. What should I reference?
On Wed, Jan 7, 2015 at 5:13 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I'm pleased to say we now have the prototype pageviews definition as a UDF!
For those with cluster access:
CREATE TEMPORARY FUNCTION pageview as 'org.wikimedia.analytics.refinery.hive.isPageviewUDF';
...and then just apply it. It outputs a boolean, so you can easily go WHERE is.Pageview(fields) and treat it as a conditional. Great success!
What this means for the definition is twofold; it means it's a lot easier to tests it accuracy, and it means that it's a lot easier to make sure we're all using the same definition going forward. Once we have the legacy definition as a UDF, refining and testing will proceed at great speed, although I encourage anyone with time on their hands who wants to help out to do some testing of their own :)
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I am not sure if this is quite what you are asking but just in case:
For streaming is probably easier for you to use the newly created webrequest tables:
For Hadoop Streaming, it’ll be a little annoying. This new data is in Parquet. Hadoop Streaming is still using the old MapReduce 1 API, and most of the officially supported Parquet input formats are for MapReduce 2 API, so by default Parquet and Hadoop Streaming are incompatible.
However! Some guy already ran into this problem and wrote this:
https://github.com/whale2/iow-hadoop-streaming/blob/master/src/main/java/net... https://github.com/whale2/iow-hadoop-streaming/blob/master/src/main/java/net/iponweb/hadoop/streaming/parquet/ParquetAsJsonInputFormat.java
On Jan 7, 2015, at 18:40, Nuria Ruiz nuria@wikimedia.org wrote:
I am not sure if this is quite what you are asking but just in case:
For streaming is probably easier for you to use the newly created webrequest tables:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Webrequest_Table.... https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Webrequest_Table.28s.29
Those include an isPageview field so requests are pre-classified. You will need to wait a bit as data for those tables is being populated starting today.
On Wed, Jan 7, 2015 at 3:35 PM, Aaron Halfaker <ahalfaker@wikimedia.org mailto:ahalfaker@wikimedia.org> wrote: Cool! Let's say I want to review the filters and apply them in a python script. What should I reference?
On Wed, Jan 7, 2015 at 5:13 PM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote: I'm pleased to say we now have the prototype pageviews definition as a UDF!
For those with cluster access:
CREATE TEMPORARY FUNCTION pageview as 'org.wikimedia.analytics.refinery.hive.isPageviewUDF';
...and then just apply it. It outputs a boolean, so you can easily go WHERE is.Pageview(fields) and treat it as a conditional. Great success!
What this means for the definition is twofold; it means it's a lot easier to tests it accuracy, and it means that it's a lot easier to make sure we're all using the same definition going forward. Once we have the legacy definition as a UDF, refining and testing will proceed at great speed, although I encourage anyone with time on their hands who wants to help out to do some testing of their own :)
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Great!
On Wed, Jan 7, 2015 at 5:49 PM, Andrew Otto aotto@wikimedia.org wrote:
I am not sure if this is quite what you are asking but just in case:
For streaming is probably easier for you to use the newly created webrequest tables:
For Hadoop Streaming, it’ll be a little annoying. This new data is in Parquet. Hadoop Streaming is still using the old MapReduce 1 API, and most of the officially supported Parquet input formats are for MapReduce 2 API, so by default Parquet and Hadoop Streaming are incompatible.
However! Some guy already ran into this problem and wrote this:
https://github.com/whale2/iow-hadoop-streaming/blob/master/src/main/java/net...
On Jan 7, 2015, at 18:40, Nuria Ruiz nuria@wikimedia.org wrote:
I am not sure if this is quite what you are asking but just in case:
For streaming is probably easier for you to use the newly created webrequest tables:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Webrequest_Table....
Those include an isPageview field so requests are pre-classified. You will need to wait a bit as data for those tables is being populated starting today.
On Wed, Jan 7, 2015 at 3:35 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Cool! Let's say I want to review the filters and apply them in a python script. What should I reference?
On Wed, Jan 7, 2015 at 5:13 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I'm pleased to say we now have the prototype pageviews definition as a UDF!
For those with cluster access:
CREATE TEMPORARY FUNCTION pageview as 'org.wikimedia.analytics.refinery.hive.isPageviewUDF';
...and then just apply it. It outputs a boolean, so you can easily go WHERE is.Pageview(fields) and treat it as a conditional. Great success!
What this means for the definition is twofold; it means it's a lot easier to tests it accuracy, and it means that it's a lot easier to make sure we're all using the same definition going forward. Once we have the legacy definition as a UDF, refining and testing will proceed at great speed, although I encourage anyone with time on their hands who wants to help out to do some testing of their own :)
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics