I am not sure if this is quite what you are asking but
just in case:
For streaming is probably easier for you to use the newly created webrequest tables:
For Hadoop Streaming, it’ll be a little annoying. This new data is in Parquet.
Hadoop Streaming is still using the old MapReduce 1 API, and most of the officially
supported Parquet input formats are for MapReduce 2 API, so by default Parquet and Hadoop
Streaming are incompatible.
However! Some guy already ran into this problem and wrote this:
https://github.com/whale2/iow-hadoop-streaming/blob/master/src/main/java/ne…
<https://github.com/whale2/iow-hadoop-streaming/blob/master/src/main/java/net/iponweb/hadoop/streaming/parquet/ParquetAsJsonInputFormat.java>
> On Jan 7, 2015, at 18:40, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
>
I am not sure if this is quite what you are asking but
just in case:
For streaming is probably easier for you to use the newly created webrequest tables:
>
>
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Webrequest_Table…
<https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Webrequest_Table.28s.29>
>
> Those include an isPageview field so requests are pre-classified. You will need to
wait a bit as data for those tables is being populated starting today.
>
>
>
> On Wed, Jan 7, 2015 at 3:35 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org
<mailto:ahalfaker@wikimedia.org>> wrote:
> Cool! Let's say I want to review the filters and apply them in a python script.
What should I reference?
>
> On Wed, Jan 7, 2015 at 5:13 PM, Oliver Keyes <okeyes(a)wikimedia.org
<mailto:okeyes@wikimedia.org>> wrote:
> I'm pleased to say we now have the prototype pageviews definition as a UDF!
>
> For those with cluster access:
>
> CREATE TEMPORARY FUNCTION pageview as
> 'org.wikimedia.analytics.refinery.hive.isPageviewUDF';
>
> ...and then just apply it. It outputs a boolean, so you can easily go
> WHERE is.Pageview(fields) and treat it as a conditional. Great
> success!
>
> What this means for the definition is twofold; it means it's a lot
> easier to tests it accuracy, and it means that it's a lot easier to
> make sure we're all using the same definition going forward. Once we
> have the legacy definition as a UDF, refining and testing will proceed
> at great speed, although I encourage anyone with time on their hands
> who wants to help out to do some testing of their own :)
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>
https://lists.wikimedia.org/mailman/listinfo/analytics
<https://lists.wikimedia.org/mailman/listinfo/analytics>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
>
https://lists.wikimedia.org/mailman/listinfo/analytics
<https://lists.wikimedia.org/mailman/listinfo/analytics>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/analytics