Hi Analytics people,
Today happens another bunch of addition to the refined webrequest table in hive. Now the table contains:
- ts - The unix timestamp (milliseconds) version of the dt date - access_method - The method used to access the site, being one of the three [mobile app | mobile web | desktop] - agent_type - To differentiate easily between spiders and users (more values may be added later).
These additions are based on the "tags", as defined here: https://meta.wikimedia.org/wiki/Research:Page_view
Have a good weekend !
Hi Joseph,
Thanks for the update, and for doing this. These three items make the analysis of the data much easier on our end. We've had many requests in the past that required agent_type and access_method information and having them readily available is awesome! :-)
Have a great weekend!
Leila
On Fri, Apr 10, 2015 at 1:21 PM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:
Hi Analytics people,
Today happens another bunch of addition to the refined webrequest table in hive. Now the table contains:
- ts - The unix timestamp (milliseconds) version of the dt date
- access_method - The method used to access the site, being one of the
three [mobile app | mobile web | desktop]
- agent_type - To differentiate easily between spiders and users (more
values may be added later).
These additions are based on the "tags", as defined here: https://meta.wikimedia.org/wiki/Research:Page_view
Have a good weekend !
-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
And I forgot one field :
- is_zero - True if a request is made on a zero provider.
On Fri, Apr 10, 2015 at 10:36 PM, Leila Zia leila@wikimedia.org wrote:
Hi Joseph,
Thanks for the update, and for doing this. These three items make the analysis of the data much easier on our end. We've had many requests in the past that required agent_type and access_method information and having them readily available is awesome! :-)
Have a great weekend!
Leila
On Fri, Apr 10, 2015 at 1:21 PM, Joseph Allemandou < jallemandou@wikimedia.org> wrote:
Hi Analytics people,
Today happens another bunch of addition to the refined webrequest table in hive. Now the table contains:
- ts - The unix timestamp (milliseconds) version of the dt date
- access_method - The method used to access the site, being one of
the three [mobile app | mobile web | desktop]
- agent_type - To differentiate easily between spiders and users
(more values may be added later).
These additions are based on the "tags", as defined here: https://meta.wikimedia.org/wiki/Research:Page_view
Have a good weekend !
-- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
What does agent-type add? In the sense that if we're pre-parsing the user agent, surely the difference is between "WHERE agent_type != 'spider'" and "WHERE user_agent_map['device_family'] != 'Spider'"? Does agent_type include the isCrawler UDF results?
On 10 April 2015 at 16:47, Joseph Allemandou jallemandou@wikimedia.org wrote:
And I forgot one field :
is_zero - True if a request is made on a zero provider.
On Fri, Apr 10, 2015 at 10:36 PM, Leila Zia leila@wikimedia.org wrote:
Hi Joseph,
Thanks for the update, and for doing this. These three items make the analysis of the data much easier on our end. We've had many requests in the past that required agent_type and access_method information and having them readily available is awesome! :-)
Have a great weekend!
Leila
On Fri, Apr 10, 2015 at 1:21 PM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi Analytics people,
Today happens another bunch of addition to the refined webrequest table in hive. Now the table contains:
ts - The unix timestamp (milliseconds) version of the dt date access_method - The method used to access the site, being one of the three [mobile app | mobile web | desktop] agent_type - To differentiate easily between spiders and users (more values may be added later).
These additions are based on the "tags", as defined here: https://meta.wikimedia.org/wiki/Research:Page_view
Have a good weekend !
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yes Oliver, the agent_type = spider includes IsCrawler UDF.
On Fri, Apr 10, 2015 at 11:08 PM, Oliver Keyes okeyes@wikimedia.org wrote:
What does agent-type add? In the sense that if we're pre-parsing the user agent, surely the difference is between "WHERE agent_type != 'spider'" and "WHERE user_agent_map['device_family'] != 'Spider'"? Does agent_type include the isCrawler UDF results?
On 10 April 2015 at 16:47, Joseph Allemandou jallemandou@wikimedia.org wrote:
And I forgot one field :
is_zero - True if a request is made on a zero provider.
On Fri, Apr 10, 2015 at 10:36 PM, Leila Zia leila@wikimedia.org wrote:
Hi Joseph,
Thanks for the update, and for doing this. These three items make the analysis of the data much easier on our end. We've had many requests in
the
past that required agent_type and access_method information and having
them
readily available is awesome! :-)
Have a great weekend!
Leila
On Fri, Apr 10, 2015 at 1:21 PM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi Analytics people,
Today happens another bunch of addition to the refined webrequest table in hive. Now the table contains:
ts - The unix timestamp (milliseconds) version of the dt date access_method - The method used to access the site, being one of the three [mobile app | mobile web | desktop] agent_type - To differentiate easily between spiders and users (more values may be added later).
These additions are based on the "tags", as defined here: https://meta.wikimedia.org/wiki/Research:Page_view
Have a good weekend !
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Cool!
On 10 April 2015 at 17:12, Joseph Allemandou jallemandou@wikimedia.org wrote:
Yes Oliver, the agent_type = spider includes IsCrawler UDF.
On Fri, Apr 10, 2015 at 11:08 PM, Oliver Keyes okeyes@wikimedia.org wrote:
What does agent-type add? In the sense that if we're pre-parsing the user agent, surely the difference is between "WHERE agent_type != 'spider'" and "WHERE user_agent_map['device_family'] != 'Spider'"? Does agent_type include the isCrawler UDF results?
On 10 April 2015 at 16:47, Joseph Allemandou jallemandou@wikimedia.org wrote:
And I forgot one field :
is_zero - True if a request is made on a zero provider.
On Fri, Apr 10, 2015 at 10:36 PM, Leila Zia leila@wikimedia.org wrote:
Hi Joseph,
Thanks for the update, and for doing this. These three items make the analysis of the data much easier on our end. We've had many requests in the past that required agent_type and access_method information and having them readily available is awesome! :-)
Have a great weekend!
Leila
On Fri, Apr 10, 2015 at 1:21 PM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi Analytics people,
Today happens another bunch of addition to the refined webrequest table in hive. Now the table contains:
ts - The unix timestamp (milliseconds) version of the dt date access_method - The method used to access the site, being one of the three [mobile app | mobile web | desktop] agent_type - To differentiate easily between spiders and users (more values may be added later).
These additions are based on the "tags", as defined here: https://meta.wikimedia.org/wiki/Research:Page_view
Have a good weekend !
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Please clarify why the field "is_zero" is needed, as it is nothing more than a test for ("zero=" in x_analytics). Does having this field significantly improve performance for zero queries, e.g. "select count(*) from requests where iszero = true" ? Because otherwise it simply identifies "zero partner" traffic, not "was that request actually zero rated or not".
Thanks!
On Fri, Apr 10, 2015 at 5:16 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Cool!
On 10 April 2015 at 17:12, Joseph Allemandou jallemandou@wikimedia.org wrote:
Yes Oliver, the agent_type = spider includes IsCrawler UDF.
On Fri, Apr 10, 2015 at 11:08 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
What does agent-type add? In the sense that if we're pre-parsing the user agent, surely the difference is between "WHERE agent_type != 'spider'" and "WHERE user_agent_map['device_family'] != 'Spider'"? Does agent_type include the isCrawler UDF results?
On 10 April 2015 at 16:47, Joseph Allemandou <jallemandou@wikimedia.org
wrote:
And I forgot one field :
is_zero - True if a request is made on a zero provider.
On Fri, Apr 10, 2015 at 10:36 PM, Leila Zia leila@wikimedia.org
wrote:
Hi Joseph,
Thanks for the update, and for doing this. These three items make the analysis of the data much easier on our end. We've had many requests
in
the past that required agent_type and access_method information and
having
them readily available is awesome! :-)
Have a great weekend!
Leila
On Fri, Apr 10, 2015 at 1:21 PM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi Analytics people,
Today happens another bunch of addition to the refined webrequest table in hive. Now the table contains:
ts - The unix timestamp (milliseconds) version of the dt date access_method - The method used to access the site, being one of the three [mobile app | mobile web | desktop] agent_type - To differentiate easily between spiders and users (more values may be added later).
These additions are based on the "tags", as defined here: https://meta.wikimedia.org/wiki/Research:Page_view
Have a good weekend !
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I tried to move Zero analytics to the new table, and decided to test the new wonderful fields like agent_type ... and it only works on the most recent hours of data ((
https://phabricator.wikimedia.org/T95806
On Fri, Apr 10, 2015 at 8:51 PM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
Please clarify why the field "is_zero" is needed, as it is nothing more than a test for ("zero=" in x_analytics). Does having this field significantly improve performance for zero queries, e.g. "select count(*) from requests where iszero = true" ? Because otherwise it simply identifies "zero partner" traffic, not "was that request actually zero rated or not".
Thanks!
On Fri, Apr 10, 2015 at 5:16 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Cool!
On 10 April 2015 at 17:12, Joseph Allemandou jallemandou@wikimedia.org wrote:
Yes Oliver, the agent_type = spider includes IsCrawler UDF.
On Fri, Apr 10, 2015 at 11:08 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
What does agent-type add? In the sense that if we're pre-parsing the user agent, surely the difference is between "WHERE agent_type != 'spider'" and "WHERE user_agent_map['device_family'] != 'Spider'"? Does agent_type include the isCrawler UDF results?
On 10 April 2015 at 16:47, Joseph Allemandou <
jallemandou@wikimedia.org>
wrote:
And I forgot one field :
is_zero - True if a request is made on a zero provider.
On Fri, Apr 10, 2015 at 10:36 PM, Leila Zia leila@wikimedia.org
wrote:
Hi Joseph,
Thanks for the update, and for doing this. These three items make the analysis of the data much easier on our end. We've had many
requests in
the past that required agent_type and access_method information and
having
them readily available is awesome! :-)
Have a great weekend!
Leila
On Fri, Apr 10, 2015 at 1:21 PM, Joseph Allemandou jallemandou@wikimedia.org wrote: > > Hi Analytics people, > > Today happens another bunch of addition to the refined webrequest > table > in hive. > Now the table contains: > > ts - The unix timestamp (milliseconds) version of the dt date > access_method - The method used to access the site, being one of
the
> three [mobile app | mobile web | desktop] > agent_type - To differentiate easily between spiders and users
(more
> values may be added later). > > These additions are based on the "tags", as defined here: > https://meta.wikimedia.org/wiki/Research:Page_view > > Have a good weekend ! > > -- > Joseph Allemandou > Data Engineer @ Wikimedia Foundation > IRC: joal > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
(Duplicated from bug):
That's not a bug. The complexity of regenerating ~60 days of data, where a day is 24*60*125000 rows, is extreme, and adding new fields means doing just that - regenerating the entire thing. As such, the decision was made to add to the field definition and only add actual values going forward from the point at which the patch was merged. This was true of the is_pageview calculation, the user agent data and the geolocation elements previously added, and is still true now.
On 11 April 2015 at 03:33, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
I tried to move Zero analytics to the new table, and decided to test the new wonderful fields like agent_type ... and it only works on the most recent hours of data ((
https://phabricator.wikimedia.org/T95806
On Fri, Apr 10, 2015 at 8:51 PM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
Please clarify why the field "is_zero" is needed, as it is nothing more than a test for ("zero=" in x_analytics). Does having this field significantly improve performance for zero queries, e.g. "select count(*) from requests where iszero = true" ? Because otherwise it simply identifies "zero partner" traffic, not "was that request actually zero rated or not".
Thanks!
On Fri, Apr 10, 2015 at 5:16 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Cool!
On 10 April 2015 at 17:12, Joseph Allemandou jallemandou@wikimedia.org wrote:
Yes Oliver, the agent_type = spider includes IsCrawler UDF.
On Fri, Apr 10, 2015 at 11:08 PM, Oliver Keyes okeyes@wikimedia.org wrote:
What does agent-type add? In the sense that if we're pre-parsing the user agent, surely the difference is between "WHERE agent_type != 'spider'" and "WHERE user_agent_map['device_family'] != 'Spider'"? Does agent_type include the isCrawler UDF results?
On 10 April 2015 at 16:47, Joseph Allemandou jallemandou@wikimedia.org wrote:
And I forgot one field :
is_zero - True if a request is made on a zero provider.
On Fri, Apr 10, 2015 at 10:36 PM, Leila Zia leila@wikimedia.org wrote: > > Hi Joseph, > > Thanks for the update, and for doing this. These three items > make > the > analysis of the data much easier on our end. We've had many > requests in > the > past that required agent_type and access_method information and > having > them > readily available is awesome! :-) > > Have a great weekend! > > Leila > > On Fri, Apr 10, 2015 at 1:21 PM, Joseph Allemandou > jallemandou@wikimedia.org wrote: >> >> Hi Analytics people, >> >> Today happens another bunch of addition to the refined webrequest >> table >> in hive. >> Now the table contains: >> >> ts - The unix timestamp (milliseconds) version of the dt date >> access_method - The method used to access the site, being one of >> the >> three [mobile app | mobile web | desktop] >> agent_type - To differentiate easily between spiders and users >> (more >> values may be added later). >> >> These additions are based on the "tags", as defined here: >> https://meta.wikimedia.org/wiki/Research:Page_view >> >> Have a good weekend ! >> >> -- >> Joseph Allemandou >> Data Engineer @ Wikimedia Foundation >> IRC: joal >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks Oliver! Is there a way to handle it in hql? E.g if( exists(is_pageview),is_pageview,null)? Finding out if field exists by observing query crash seems wrong )) On Apr 12, 2015 06:53, "Oliver Keyes" okeyes@wikimedia.org wrote:
(Duplicated from bug):
That's not a bug. The complexity of regenerating ~60 days of data, where a day is 24*60*125000 rows, is extreme, and adding new fields means doing just that - regenerating the entire thing. As such, the decision was made to add to the field definition and only add actual values going forward from the point at which the patch was merged. This was true of the is_pageview calculation, the user agent data and the geolocation elements previously added, and is still true now.
On 11 April 2015 at 03:33, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
I tried to move Zero analytics to the new table, and decided to test the
new
wonderful fields like agent_type ... and it only works on the most recent hours of data ((
https://phabricator.wikimedia.org/T95806
On Fri, Apr 10, 2015 at 8:51 PM, Yuri Astrakhan <
yastrakhan@wikimedia.org>
wrote:
Please clarify why the field "is_zero" is needed, as it is nothing more than a test for ("zero=" in x_analytics). Does having this field significantly improve performance for zero queries, e.g. "select
count(*)
from requests where iszero = true" ? Because otherwise it simply
identifies
"zero partner" traffic, not "was that request actually zero rated or
not".
Thanks!
On Fri, Apr 10, 2015 at 5:16 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Cool!
On 10 April 2015 at 17:12, Joseph Allemandou <
jallemandou@wikimedia.org>
wrote:
Yes Oliver, the agent_type = spider includes IsCrawler UDF.
On Fri, Apr 10, 2015 at 11:08 PM, Oliver Keyes <okeyes@wikimedia.org
wrote:
What does agent-type add? In the sense that if we're pre-parsing the user agent, surely the difference is between "WHERE agent_type != 'spider'" and "WHERE user_agent_map['device_family'] != 'Spider'"? Does agent_type include the isCrawler UDF results?
On 10 April 2015 at 16:47, Joseph Allemandou jallemandou@wikimedia.org wrote: > And I forgot one field : > > is_zero - True if a request is made on a zero provider. > > > On Fri, Apr 10, 2015 at 10:36 PM, Leila Zia leila@wikimedia.org > wrote: >> >> Hi Joseph, >> >> Thanks for the update, and for doing this. These three items >> make >> the >> analysis of the data much easier on our end. We've had many >> requests in >> the >> past that required agent_type and access_method information and >> having >> them >> readily available is awesome! :-) >> >> Have a great weekend! >> >> Leila >> >> On Fri, Apr 10, 2015 at 1:21 PM, Joseph Allemandou >> jallemandou@wikimedia.org wrote: >>> >>> Hi Analytics people, >>> >>> Today happens another bunch of addition to the refined
webrequest
>>> table >>> in hive. >>> Now the table contains: >>> >>> ts - The unix timestamp (milliseconds) version of the dt date >>> access_method - The method used to access the site, being one of >>> the >>> three [mobile app | mobile web | desktop] >>> agent_type - To differentiate easily between spiders and users >>> (more >>> values may be added later). >>> >>> These additions are based on the "tags", as defined here: >>> https://meta.wikimedia.org/wiki/Research:Page_view >>> >>> Have a good weekend ! >>> >>> -- >>> Joseph Allemandou >>> Data Engineer @ Wikimedia Foundation >>> IRC: joal >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > > -- > Joseph Allemandou > Data Engineer @ Wikimedia Foundation > IRC: joal > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
You probably have to do it conditionally by date
On Apr 12, 2015, at 12:38, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
Thanks Oliver! Is there a way to handle it in hql? E.g if( exists(is_pageview),is_pageview,null)? Finding out if field exists by observing query crash seems wrong ))
On Apr 12, 2015 06:53, "Oliver Keyes" okeyes@wikimedia.org wrote: (Duplicated from bug):
That's not a bug. The complexity of regenerating ~60 days of data, where a day is 24*60*125000 rows, is extreme, and adding new fields means doing just that - regenerating the entire thing. As such, the decision was made to add to the field definition and only add actual values going forward from the point at which the patch was merged. This was true of the is_pageview calculation, the user agent data and the geolocation elements previously added, and is still true now.
On 11 April 2015 at 03:33, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
I tried to move Zero analytics to the new table, and decided to test the new wonderful fields like agent_type ... and it only works on the most recent hours of data ((
https://phabricator.wikimedia.org/T95806
On Fri, Apr 10, 2015 at 8:51 PM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
Please clarify why the field "is_zero" is needed, as it is nothing more than a test for ("zero=" in x_analytics). Does having this field significantly improve performance for zero queries, e.g. "select count(*) from requests where iszero = true" ? Because otherwise it simply identifies "zero partner" traffic, not "was that request actually zero rated or not".
Thanks!
On Fri, Apr 10, 2015 at 5:16 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Cool!
On 10 April 2015 at 17:12, Joseph Allemandou jallemandou@wikimedia.org wrote:
Yes Oliver, the agent_type = spider includes IsCrawler UDF.
On Fri, Apr 10, 2015 at 11:08 PM, Oliver Keyes okeyes@wikimedia.org wrote: > > What does agent-type add? In the sense that if we're pre-parsing the > user agent, surely the difference is between "WHERE agent_type != > 'spider'" and "WHERE user_agent_map['device_family'] != 'Spider'"? > Does agent_type include the isCrawler UDF results? > > On 10 April 2015 at 16:47, Joseph Allemandou > jallemandou@wikimedia.org > wrote: > > And I forgot one field : > > > > is_zero - True if a request is made on a zero provider. > > > > > > On Fri, Apr 10, 2015 at 10:36 PM, Leila Zia leila@wikimedia.org > > wrote: > >> > >> Hi Joseph, > >> > >> Thanks for the update, and for doing this. These three items > >> make > >> the > >> analysis of the data much easier on our end. We've had many > >> requests in > >> the > >> past that required agent_type and access_method information and > >> having > >> them > >> readily available is awesome! :-) > >> > >> Have a great weekend! > >> > >> Leila > >> > >> On Fri, Apr 10, 2015 at 1:21 PM, Joseph Allemandou > >> jallemandou@wikimedia.org wrote: > >>> > >>> Hi Analytics people, > >>> > >>> Today happens another bunch of addition to the refined webrequest > >>> table > >>> in hive. > >>> Now the table contains: > >>> > >>> ts - The unix timestamp (milliseconds) version of the dt date > >>> access_method - The method used to access the site, being one of > >>> the > >>> three [mobile app | mobile web | desktop] > >>> agent_type - To differentiate easily between spiders and users > >>> (more > >>> values may be added later). > >>> > >>> These additions are based on the "tags", as defined here: > >>> https://meta.wikimedia.org/wiki/Research:Page_view > >>> > >>> Have a good weekend ! > >>> > >>> -- > >>> Joseph Allemandou > >>> Data Engineer @ Wikimedia Foundation > >>> IRC: joal > >>> > >>> _______________________________________________ > >>> Analytics mailing list > >>> Analytics@lists.wikimedia.org > >>> https://lists.wikimedia.org/mailman/listinfo/analytics > >>> > >> > >> > >> _______________________________________________ > >> Analytics mailing list > >> Analytics@lists.wikimedia.org > >> https://lists.wikimedia.org/mailman/listinfo/analytics > >> > > > > > > > > -- > > Joseph Allemandou > > Data Engineer @ Wikimedia Foundation > > IRC: joal > > > > _______________________________________________ > > Analytics mailing list > > Analytics@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Yuri --
In general, I do not think this table will change a lot moving forward. We're migrating to a more complete definition right now so some changes are to be expected but things should settle down.
Thanks for the new fields!
-Toby
On Sun, Apr 12, 2015 at 9:55 AM, Andrew Otto aotto@wikimedia.org wrote:
You probably have to do it conditionally by date
On Apr 12, 2015, at 12:38, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
Thanks Oliver! Is there a way to handle it in hql? E.g if( exists(is_pageview),is_pageview,null)? Finding out if field exists by observing query crash seems wrong )) On Apr 12, 2015 06:53, "Oliver Keyes" okeyes@wikimedia.org wrote:
(Duplicated from bug):
That's not a bug. The complexity of regenerating ~60 days of data, where a day is 24*60*125000 rows, is extreme, and adding new fields means doing just that - regenerating the entire thing. As such, the decision was made to add to the field definition and only add actual values going forward from the point at which the patch was merged. This was true of the is_pageview calculation, the user agent data and the geolocation elements previously added, and is still true now.
On 11 April 2015 at 03:33, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
I tried to move Zero analytics to the new table, and decided to test
the new
wonderful fields like agent_type ... and it only works on the most
recent
hours of data ((
https://phabricator.wikimedia.org/T95806
On Fri, Apr 10, 2015 at 8:51 PM, Yuri Astrakhan <
yastrakhan@wikimedia.org>
wrote:
Please clarify why the field "is_zero" is needed, as it is nothing more than a test for ("zero=" in x_analytics). Does having this field significantly improve performance for zero queries, e.g. "select
count(*)
from requests where iszero = true" ? Because otherwise it simply
identifies
"zero partner" traffic, not "was that request actually zero rated or
not".
Thanks!
On Fri, Apr 10, 2015 at 5:16 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Cool!
On 10 April 2015 at 17:12, Joseph Allemandou <
jallemandou@wikimedia.org>
wrote:
Yes Oliver, the agent_type = spider includes IsCrawler UDF.
On Fri, Apr 10, 2015 at 11:08 PM, Oliver Keyes <
okeyes@wikimedia.org>
wrote: > > What does agent-type add? In the sense that if we're pre-parsing
the
> user agent, surely the difference is between "WHERE agent_type != > 'spider'" and "WHERE user_agent_map['device_family'] != 'Spider'"? > Does agent_type include the isCrawler UDF results? > > On 10 April 2015 at 16:47, Joseph Allemandou > jallemandou@wikimedia.org > wrote: > > And I forgot one field : > > > > is_zero - True if a request is made on a zero provider. > > > > > > On Fri, Apr 10, 2015 at 10:36 PM, Leila Zia <leila@wikimedia.org
> > wrote: > >> > >> Hi Joseph, > >> > >> Thanks for the update, and for doing this. These three items > >> make > >> the > >> analysis of the data much easier on our end. We've had many > >> requests in > >> the > >> past that required agent_type and access_method information and > >> having > >> them > >> readily available is awesome! :-) > >> > >> Have a great weekend! > >> > >> Leila > >> > >> On Fri, Apr 10, 2015 at 1:21 PM, Joseph Allemandou > >> jallemandou@wikimedia.org wrote: > >>> > >>> Hi Analytics people, > >>> > >>> Today happens another bunch of addition to the refined
webrequest
> >>> table > >>> in hive. > >>> Now the table contains: > >>> > >>> ts - The unix timestamp (milliseconds) version of the dt date > >>> access_method - The method used to access the site, being one
of
> >>> the > >>> three [mobile app | mobile web | desktop] > >>> agent_type - To differentiate easily between spiders and users > >>> (more > >>> values may be added later). > >>> > >>> These additions are based on the "tags", as defined here: > >>> https://meta.wikimedia.org/wiki/Research:Page_view > >>> > >>> Have a good weekend ! > >>> > >>> -- > >>> Joseph Allemandou > >>> Data Engineer @ Wikimedia Foundation > >>> IRC: joal > >>> > >>> _______________________________________________ > >>> Analytics mailing list > >>> Analytics@lists.wikimedia.org > >>> https://lists.wikimedia.org/mailman/listinfo/analytics > >>> > >> > >> > >> _______________________________________________ > >> Analytics mailing list > >> Analytics@lists.wikimedia.org > >> https://lists.wikimedia.org/mailman/listinfo/analytics > >> > > > > > > > > -- > > Joseph Allemandou > > Data Engineer @ Wikimedia Foundation > > IRC: joal > > > > _______________________________________________ > > Analytics mailing list > > Analytics@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Look at the record_version field to know if the new column is populated. https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Changes_and_kn...
On Sun, Apr 12, 2015 at 10:43 AM, Toby Negrin tnegrin@wikimedia.org wrote:
Hi Yuri --
In general, I do not think this table will change a lot moving forward. We're migrating to a more complete definition right now so some changes are to be expected but things should settle down.
Thanks for the new fields!
-Toby
On Sun, Apr 12, 2015 at 9:55 AM, Andrew Otto aotto@wikimedia.org wrote:
You probably have to do it conditionally by date
On Apr 12, 2015, at 12:38, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
Thanks Oliver! Is there a way to handle it in hql? E.g if( exists(is_pageview),is_pageview,null)? Finding out if field exists by observing query crash seems wrong )) On Apr 12, 2015 06:53, "Oliver Keyes" okeyes@wikimedia.org wrote:
(Duplicated from bug):
That's not a bug. The complexity of regenerating ~60 days of data, where a day is 24*60*125000 rows, is extreme, and adding new fields means doing just that - regenerating the entire thing. As such, the decision was made to add to the field definition and only add actual values going forward from the point at which the patch was merged. This was true of the is_pageview calculation, the user agent data and the geolocation elements previously added, and is still true now.
On 11 April 2015 at 03:33, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
I tried to move Zero analytics to the new table, and decided to test
the new
wonderful fields like agent_type ... and it only works on the most
recent
hours of data ((
https://phabricator.wikimedia.org/T95806
On Fri, Apr 10, 2015 at 8:51 PM, Yuri Astrakhan <
yastrakhan@wikimedia.org>
wrote:
Please clarify why the field "is_zero" is needed, as it is nothing
more
than a test for ("zero=" in x_analytics). Does having this field significantly improve performance for zero queries, e.g. "select
count(*)
from requests where iszero = true" ? Because otherwise it simply
identifies
"zero partner" traffic, not "was that request actually zero rated or
not".
Thanks!
On Fri, Apr 10, 2015 at 5:16 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Cool!
On 10 April 2015 at 17:12, Joseph Allemandou <
jallemandou@wikimedia.org>
wrote: > Yes Oliver, the agent_type = spider includes IsCrawler UDF. > > On Fri, Apr 10, 2015 at 11:08 PM, Oliver Keyes <
okeyes@wikimedia.org>
> wrote: >> >> What does agent-type add? In the sense that if we're pre-parsing
the
>> user agent, surely the difference is between "WHERE agent_type != >> 'spider'" and "WHERE user_agent_map['device_family'] != 'Spider'"? >> Does agent_type include the isCrawler UDF results? >> >> On 10 April 2015 at 16:47, Joseph Allemandou >> jallemandou@wikimedia.org >> wrote: >> > And I forgot one field : >> > >> > is_zero - True if a request is made on a zero provider. >> > >> > >> > On Fri, Apr 10, 2015 at 10:36 PM, Leila Zia <
leila@wikimedia.org>
>> > wrote: >> >> >> >> Hi Joseph, >> >> >> >> Thanks for the update, and for doing this. These three items >> >> make >> >> the >> >> analysis of the data much easier on our end. We've had many >> >> requests in >> >> the >> >> past that required agent_type and access_method information and >> >> having >> >> them >> >> readily available is awesome! :-) >> >> >> >> Have a great weekend! >> >> >> >> Leila >> >> >> >> On Fri, Apr 10, 2015 at 1:21 PM, Joseph Allemandou >> >> jallemandou@wikimedia.org wrote: >> >>> >> >>> Hi Analytics people, >> >>> >> >>> Today happens another bunch of addition to the refined
webrequest
>> >>> table >> >>> in hive. >> >>> Now the table contains: >> >>> >> >>> ts - The unix timestamp (milliseconds) version of the dt date >> >>> access_method - The method used to access the site, being one
of
>> >>> the >> >>> three [mobile app | mobile web | desktop] >> >>> agent_type - To differentiate easily between spiders and users >> >>> (more >> >>> values may be added later). >> >>> >> >>> These additions are based on the "tags", as defined here: >> >>> https://meta.wikimedia.org/wiki/Research:Page_view >> >>> >> >>> Have a good weekend ! >> >>> >> >>> -- >> >>> Joseph Allemandou >> >>> Data Engineer @ Wikimedia Foundation >> >>> IRC: joal >> >>> >> >>> _______________________________________________ >> >>> Analytics mailing list >> >>> Analytics@lists.wikimedia.org >> >>> https://lists.wikimedia.org/mailman/listinfo/analytics >> >>> >> >> >> >> >> >> _______________________________________________ >> >> Analytics mailing list >> >> Analytics@lists.wikimedia.org >> >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> > >> > >> > >> > -- >> > Joseph Allemandou >> > Data Engineer @ Wikimedia Foundation >> > IRC: joal >> > >> > _______________________________________________ >> > Analytics mailing list >> > Analytics@lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/analytics >> > >> >> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > -- > Joseph Allemandou > Data Engineer @ Wikimedia Foundation > IRC: joal > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics