I just posted a comment on the famous task: https://phabricator.wikimedia.org/T44259#1341010 :)
Here it is for those who would rather discuss on this list:
We have finished analyzing the intermediate hourly aggregate with all the columns that we think are interesting. The data is too large to query and anonymize in real time. We'd rather get an API out faster than deal with that problem, so we decided to produce smaller "cubes" [1] of data for specific purposes. We have two cubes in mind and I'll explain those here. For each cube, we're aiming to have:
* Direct access to a postgresql database in labs with the data * API access through RESTBase * Mondrian / Saiku access in labs for dimensional analysis * Data will be pre-aggregated so that any single data point has k-anonymity (we have not determined a good k yet) * Higher level aggregations will be pre-computed so they use all data
And, the cubes are:
**stats.grok.se Cube: basic pageview data**
Hourly resolution. Will serve the same purpose as stats.grok.se has served for so many years. The dimensions available will be:
* project - 'Project name from requests host name' * dialect - 'Dialect from requests path (not set if present in project name)' * page_title - 'Page Title from requests path and query' * access_method - 'Method used to access the pages, can be desktop, mobile web, or mobile app' * is_zero - 'accessed through a zero provider' * agent_type - 'Agent accessing the pages, can be spider or user' * referer_class - 'Can be internal, external or unknown'
**Geo Cube: geo-coded pageview data**
Daily resolution. Will allow researchers to track the flu, breaking news, etc. Dimensions will be:
* project - 'Project name from requests hostname' * page_title - 'Page Title from requests path and query' * country_code - 'Country ISO code of the accessing agents (computed using MaxMind GeoIP database)' * province - 'State / Province of the accessing agents (computed using MaxMind GeoIP database)' * city - 'Metro area of the accessing agents (computed using MaxMind GeoIP database)'
So, if anyone wants another cube, **now** is the time to speak up. We'll probably add cubes later, but it may be a while.
[1] OLAP cubes: https://en.wikipedia.org/wiki/OLAP_cube
My only thought is that "city" makes me uncomfortable. Did we track down a precise use case for that in the end?
On 5 June 2015 at 09:25, Dan Andreescu dandreescu@wikimedia.org wrote:
I just posted a comment on the famous task: https://phabricator.wikimedia.org/T44259#1341010 :)
Here it is for those who would rather discuss on this list:
We have finished analyzing the intermediate hourly aggregate with all the columns that we think are interesting. The data is too large to query and anonymize in real time. We'd rather get an API out faster than deal with that problem, so we decided to produce smaller "cubes" [1] of data for specific purposes. We have two cubes in mind and I'll explain those here. For each cube, we're aiming to have:
- Direct access to a postgresql database in labs with the data
- API access through RESTBase
- Mondrian / Saiku access in labs for dimensional analysis
- Data will be pre-aggregated so that any single data point has k-anonymity
(we have not determined a good k yet)
- Higher level aggregations will be pre-computed so they use all data
And, the cubes are:
**stats.grok.se Cube: basic pageview data**
Hourly resolution. Will serve the same purpose as stats.grok.se has served for so many years. The dimensions available will be:
- project - 'Project name from requests host name'
- dialect - 'Dialect from requests path (not set if present in project
name)'
- page_title - 'Page Title from requests path and query'
- access_method - 'Method used to access the pages, can be desktop, mobile
web, or mobile app'
- is_zero - 'accessed through a zero provider'
- agent_type - 'Agent accessing the pages, can be spider or user'
- referer_class - 'Can be internal, external or unknown'
**Geo Cube: geo-coded pageview data**
Daily resolution. Will allow researchers to track the flu, breaking news, etc. Dimensions will be:
- project - 'Project name from requests hostname'
- page_title - 'Page Title from requests path and query'
- country_code - 'Country ISO code of the accessing agents (computed using
MaxMind GeoIP database)'
- province - 'State / Province of the accessing agents (computed using
MaxMind GeoIP database)'
- city - 'Metro area of the accessing agents (computed using MaxMind GeoIP
database)'
So, if anyone wants another cube, **now** is the time to speak up. We'll probably add cubes later, but it may be a while.
[1] OLAP cubes: https://en.wikipedia.org/wiki/OLAP_cube
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
My only thought is that "city" makes me uncomfortable. Did we track down a precise use case for that in the end?
Yes, the Los Alamos National Lab folks' proposal: https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagevi...
We talked to them yesterday and it seems the time granularity is not as important. That's why that dataset is *daily* and the other one is *hourly*. Either way, these will be k-anonymized at any level. Once we have some data up, though, I'd love for people who are good at this to try and attack the datasets in combination and from different points of view like t-closeness, etc. I don't want to leak any info and any help on that is appreciated 'cause it's a hard problem.
Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals? It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?
For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.
On 5 June 2015 at 09:35, Dan Andreescu dandreescu@wikimedia.org wrote:
My only thought is that "city" makes me uncomfortable. Did we track down a precise use case for that in the end?
Yes, the Los Alamos National Lab folks' proposal: https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagevi...
We talked to them yesterday and it seems the time granularity is not as important. That's why that dataset is *daily* and the other one is *hourly*. Either way, these will be k-anonymized at any level. Once we have some data up, though, I'd love for people who are good at this to try and attack the datasets in combination and from different points of view like t-closeness, etc. I don't want to leak any info and any help on that is appreciated 'cause it's a hard problem.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals?
Well, so the geo cube has to guess a bit at who would find it useful in the future.
It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?
For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.
I agree with you, but I'm not sure the data is risky if it's k-anonymous. Most likely, just doing that will limit the countries for which metro level data is available.
On 5 June 2015 at 10:38, Dan Andreescu dandreescu@wikimedia.org wrote:
Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals?
Well, so the geo cube has to guess a bit at who would find it useful in the future.
It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?
For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.
I agree with you, but I'm not sure the data is risky if it's k-anonymous. Most likely, just doing that will limit the countries for which metro level data is available.
I don't think it is if it is! As you said, though, we need to hammer on it for a while to make absolutely sure it's okay, and using lower-resolution data would not only make this easier but also reduce the cost of getting people wrong (geolocating people to MA is less dangerous than geolocating them to Arlington)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks Dan, and apologies if these are naive questions:
For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this data.
For apps can we see ios v android?
On Fri, Jun 5, 2015 at 8:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:
On 5 June 2015 at 10:38, Dan Andreescu dandreescu@wikimedia.org wrote:
Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals?
Well, so the geo cube has to guess a bit at who would find it useful in
the
future.
It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?
For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.
I agree with you, but I'm not sure the data is risky if it's k-anonymous. Most likely, just doing that will limit the countries for which metro
level
data is available.
I don't think it is if it is! As you said, though, we need to hammer on it for a while to make absolutely sure it's okay, and using lower-resolution data would not only make this easier but also reduce the cost of getting people wrong (geolocating people to MA is less dangerous than geolocating them to Arlington)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I came across another potential requirement from the WP Zero team: add the x-analytics['zero'] to the dimensions. This would allow the zero team to get pageviews per partner carrier. Our partners are interested in this data, however, they don't want to share it with anyone as it is competitive data, and we can't make it public.
On Fri, Jun 5, 2015 at 10:51 AM, Jon Katz jkatz@wikimedia.org wrote:
Thanks Dan, and apologies if these are naive questions:
For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this data.
For apps can we see ios v android?
On Fri, Jun 5, 2015 at 8:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:
On 5 June 2015 at 10:38, Dan Andreescu dandreescu@wikimedia.org wrote:
Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals?
Well, so the geo cube has to guess a bit at who would find it useful in
the
future.
It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?
For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.
I agree with you, but I'm not sure the data is risky if it's
k-anonymous.
Most likely, just doing that will limit the countries for which metro
level
data is available.
I don't think it is if it is! As you said, though, we need to hammer on it for a while to make absolutely sure it's okay, and using lower-resolution data would not only make this easier but also reduce the cost of getting people wrong (geolocating people to MA is less dangerous than geolocating them to Arlington)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yep, the Zero tag would be very useful. Also, something zero team found highly useful are: via (proxy value in x-analytics), https vs http, zero sub domain vs m subdomain. On Jun 5, 2015 14:51, "Kevin Leduc" kevin@wikimedia.org wrote:
I came across another potential requirement from the WP Zero team: add the x-analytics['zero'] to the dimensions. This would allow the zero team to get pageviews per partner carrier. Our partners are interested in this data, however, they don't want to share it with anyone as it is competitive data, and we can't make it public.
On Fri, Jun 5, 2015 at 10:51 AM, Jon Katz jkatz@wikimedia.org wrote:
Thanks Dan, and apologies if these are naive questions:
For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this data.
For apps can we see ios v android?
On Fri, Jun 5, 2015 at 8:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:
On 5 June 2015 at 10:38, Dan Andreescu dandreescu@wikimedia.org wrote:
Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals?
Well, so the geo cube has to guess a bit at who would find it useful
in the
future.
It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?
For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.
I agree with you, but I'm not sure the data is risky if it's
k-anonymous.
Most likely, just doing that will limit the countries for which metro
level
data is available.
I don't think it is if it is! As you said, though, we need to hammer on it for a while to make absolutely sure it's okay, and using lower-resolution data would not only make this easier but also reduce the cost of getting people wrong (geolocating people to MA is less dangerous than geolocating them to Arlington)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.
On 5 June 2015 at 14:56, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
Yep, the Zero tag would be very useful. Also, something zero team found highly useful are: via (proxy value in x-analytics), https vs http, zero sub domain vs m subdomain.
On Jun 5, 2015 14:51, "Kevin Leduc" kevin@wikimedia.org wrote:
I came across another potential requirement from the WP Zero team: add the x-analytics['zero'] to the dimensions. This would allow the zero team to get pageviews per partner carrier. Our partners are interested in this data, however, they don't want to share it with anyone as it is competitive data, and we can't make it public.
On Fri, Jun 5, 2015 at 10:51 AM, Jon Katz jkatz@wikimedia.org wrote:
Thanks Dan, and apologies if these are naive questions:
For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this data.
For apps can we see ios v android?
On Fri, Jun 5, 2015 at 8:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:
On 5 June 2015 at 10:38, Dan Andreescu dandreescu@wikimedia.org wrote:
Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals?
Well, so the geo cube has to guess a bit at who would find it useful in the future.
It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?
For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.
I agree with you, but I'm not sure the data is risky if it's k-anonymous. Most likely, just doing that will limit the countries for which metro level data is available.
I don't think it is if it is! As you said, though, we need to hammer on it for a while to make absolutely sure it's okay, and using lower-resolution data would not only make this easier but also reduce the cost of getting people wrong (geolocating people to MA is less dangerous than geolocating them to Arlington)
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Fri, Jun 5, 2015 at 3:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:
If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.
Right, to clarify, this proposal is for a public data set and API.
Thanks Dan, and apologies if these are naive questions:
For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this
data.
For apps can we see ios v android?
Jon, we chose to not include that information in order to limit the amount of data that we'd have to deal with. If it gets too large, it won't fit into PostgreSQL. For the iOS / Android and beta / alpha versions of the site we can either:
* Make a new cube that examines site versions and client information * Just use the private data as we're already doing, but aggregate it hourly or daily as needed, to make analysis much faster.
Hi Dan, Sorry for the late response to this--
** Make a new cube that examines site versions and client information* ** Just use the private data as we're already doing, but aggregate it hourly or daily as needed, to make analysis much faster.*
How can I help add/keep this to/on your roadmap? -J
On Fri, Jun 5, 2015 at 12:28 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Fri, Jun 5, 2015 at 3:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:
If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.
Right, to clarify, this proposal is for a public data set and API.
Thanks Dan, and apologies if these are naive questions:
For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this
data.
For apps can we see ios v android?
Jon, we chose to not include that information in order to limit the amount of data that we'd have to deal with. If it gets too large, it won't fit into PostgreSQL. For the iOS / Android and beta / alpha versions of the site we can either:
- Make a new cube that examines site versions and client information
- Just use the private data as we're already doing, but aggregate it
hourly or daily as needed, to make analysis much faster.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
In light of the recent switch to use HTTPS, what about adding http/https information. Maybe it can be added to the 'access_method' rather than adding a new dimension?
On Thu, Jun 11, 2015 at 1:46 PM, Jon Katz jkatz@wikimedia.org wrote:
Hi Dan, Sorry for the late response to this--
** Make a new cube that examines site versions and client information* ** Just use the private data as we're already doing, but aggregate it hourly or daily as needed, to make analysis much faster.*
How can I help add/keep this to/on your roadmap? -J
On Fri, Jun 5, 2015 at 12:28 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Fri, Jun 5, 2015 at 3:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:
If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.
Right, to clarify, this proposal is for a public data set and API.
Thanks Dan, and apologies if these are naive questions:
For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this
data.
For apps can we see ios v android?
Jon, we chose to not include that information in order to limit the amount of data that we'd have to deal with. If it gets too large, it won't fit into PostgreSQL. For the iOS / Android and beta / alpha versions of the site we can either:
- Make a new cube that examines site versions and client information
- Just use the private data as we're already doing, but aggregate it
hourly or daily as needed, to make analysis much faster.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
X-analytics contains HTTPS=1 for all text hits, but not for other types of traffic On Jun 15, 2015 00:54, "Kevin Leduc" kevin@wikimedia.org wrote:
In light of the recent switch to use HTTPS, what about adding http/https information. Maybe it can be added to the 'access_method' rather than adding a new dimension?
On Thu, Jun 11, 2015 at 1:46 PM, Jon Katz jkatz@wikimedia.org wrote:
Hi Dan, Sorry for the late response to this--
** Make a new cube that examines site versions and client information* ** Just use the private data as we're already doing, but aggregate it hourly or daily as needed, to make analysis much faster.*
How can I help add/keep this to/on your roadmap? -J
On Fri, Jun 5, 2015 at 12:28 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Fri, Jun 5, 2015 at 3:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:
If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.
Right, to clarify, this proposal is for a public data set and API.
> Thanks Dan, and apologies if these are naive questions: > > For mobile web can we also see beta v. stable? This is important
for
> tracking prototypes, which is one of the core product uses for this
data.
> > For apps can we see ios v android?
Jon, we chose to not include that information in order to limit the amount of data that we'd have to deal with. If it gets too large, it won't fit into PostgreSQL. For the iOS / Android and beta / alpha versions of the site we can either:
- Make a new cube that examines site versions and client information
- Just use the private data as we're already doing, but aggregate it
hourly or daily as needed, to make analysis much faster.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Well, that's not the case; https=1 was added for the apps, and so hit both mobile and text varnishes. Since all pageviews go through the text or mobile sources, all pageviews note (implicitly or explicitly) their https status.
In regards to the initial suggestion; no. Don't add new fields. Don't propose new fields. Don't do anything with new fields - freeze the definition already. We've had a pageviews definition for 6 months, we've had unreliability in Henrik's third-party service for 12, and every time there's a "should we add a new field?" proposal it slows implementing an alternative down. We need that alternative.
On 15 June 2015 at 00:57, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
X-analytics contains HTTPS=1 for all text hits, but not for other types of traffic
On Jun 15, 2015 00:54, "Kevin Leduc" kevin@wikimedia.org wrote:
In light of the recent switch to use HTTPS, what about adding http/https information. Maybe it can be added to the 'access_method' rather than adding a new dimension?
On Thu, Jun 11, 2015 at 1:46 PM, Jon Katz jkatz@wikimedia.org wrote:
Hi Dan, Sorry for the late response to this--
- Make a new cube that examines site versions and client information
- Just use the private data as we're already doing, but aggregate it
hourly or daily as needed, to make analysis much faster.
How can I help add/keep this to/on your roadmap? -J
On Fri, Jun 5, 2015 at 12:28 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Fri, Jun 5, 2015 at 3:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:
If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.
Right, to clarify, this proposal is for a public data set and API.
>> Thanks Dan, and apologies if these are naive questions: >> >> For mobile web can we also see beta v. stable? This is important >> for >> tracking prototypes, which is one of the core product uses for this >> data. >> >> For apps can we see ios v android?
Jon, we chose to not include that information in order to limit the amount of data that we'd have to deal with. If it gets too large, it won't fit into PostgreSQL. For the iOS / Android and beta / alpha versions of the site we can either:
- Make a new cube that examines site versions and client information
- Just use the private data as we're already doing, but aggregate it
hourly or daily as needed, to make analysis much faster.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Two quick updates:
What Oliver said resonates with us, we are doing everything possible to focus and keep the project moving instead of satisfying all possible requirements at launch.
We have been working our goals (not yet finalized) to include "Pageview API by September". There is quite a bit of puppetizing and productionizing to do, but we are removing more and more distractions so I personally feel optimistic.
Also, Jon K and other internal folks, the intermediate aggregate is available for you to query and it's updated hourly.
On Mon, Jun 15, 2015 at 9:40 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Well, that's not the case; https=1 was added for the apps, and so hit both mobile and text varnishes. Since all pageviews go through the text or mobile sources, all pageviews note (implicitly or explicitly) their https status.
In regards to the initial suggestion; no. Don't add new fields. Don't propose new fields. Don't do anything with new fields - freeze the definition already. We've had a pageviews definition for 6 months, we've had unreliability in Henrik's third-party service for 12, and every time there's a "should we add a new field?" proposal it slows implementing an alternative down. We need that alternative.
On 15 June 2015 at 00:57, Yuri Astrakhan yastrakhan@wikimedia.org wrote:
X-analytics contains HTTPS=1 for all text hits, but not for other types
of
traffic
On Jun 15, 2015 00:54, "Kevin Leduc" kevin@wikimedia.org wrote:
In light of the recent switch to use HTTPS, what about adding http/https information. Maybe it can be added to the 'access_method' rather than adding a new dimension?
On Thu, Jun 11, 2015 at 1:46 PM, Jon Katz jkatz@wikimedia.org wrote:
Hi Dan, Sorry for the late response to this--
- Make a new cube that examines site versions and client information
- Just use the private data as we're already doing, but aggregate it
hourly or daily as needed, to make analysis much faster.
How can I help add/keep this to/on your roadmap? -J
On Fri, Jun 5, 2015 at 12:28 PM, Dan Andreescu <
dandreescu@wikimedia.org>
wrote:
On Fri, Jun 5, 2015 at 3:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:
If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.
Right, to clarify, this proposal is for a public data set and API.
>>> Thanks Dan, and apologies if these are naive questions: >>> >>> For mobile web can we also see beta v. stable? This is important >>> for >>> tracking prototypes, which is one of the core product uses for
this
>>> data. >>> >>> For apps can we see ios v android?
Jon, we chose to not include that information in order to limit the amount of data that we'd have to deal with. If it gets too large, it
won't
fit into PostgreSQL. For the iOS / Android and beta / alpha versions
of the
site we can either:
- Make a new cube that examines site versions and client information
- Just use the private data as we're already doing, but aggregate it
hourly or daily as needed, to make analysis much faster.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics