Pageview API Status update

List overview All Threads
Download

newer

older

Monthly compressed traffic delay

analtyics-store eventlogging UNION...

Dan Andreescu

5 Jun 2015 5 Jun '15

9:25 p.m.

I just posted a comment on the famous task: https://phabricator.wikimedia.org/T44259#1341010 :)

Here it is for those who would rather discuss on this list:

We have finished analyzing the intermediate hourly aggregate with all the columns that we think are interesting. The data is too large to query and anonymize in real time. We'd rather get an API out faster than deal with that problem, so we decided to produce smaller "cubes" [1] of data for specific purposes. We have two cubes in mind and I'll explain those here. For each cube, we're aiming to have:

* Direct access to a postgresql database in labs with the data * API access through RESTBase * Mondrian / Saiku access in labs for dimensional analysis * Data will be pre-aggregated so that any single data point has k-anonymity (we have not determined a good k yet) * Higher level aggregations will be pre-computed so they use all data

And, the cubes are:

**stats.grok.se Cube: basic pageview data**

Hourly resolution. Will serve the same purpose as stats.grok.se has served for so many years. The dimensions available will be:

* project - 'Project name from requests host name' * dialect - 'Dialect from requests path (not set if present in project name)' * page_title - 'Page Title from requests path and query' * access_method - 'Method used to access the pages, can be desktop, mobile web, or mobile app' * is_zero - 'accessed through a zero provider' * agent_type - 'Agent accessing the pages, can be spider or user' * referer_class - 'Can be internal, external or unknown'

**Geo Cube: geo-coded pageview data**

Daily resolution. Will allow researchers to track the flu, breaking news, etc. Dimensions will be:

* project - 'Project name from requests hostname' * page_title - 'Page Title from requests path and query' * country_code - 'Country ISO code of the accessing agents (computed using MaxMind GeoIP database)' * province - 'State / Province of the accessing agents (computed using MaxMind GeoIP database)' * city - 'Metro area of the accessing agents (computed using MaxMind GeoIP database)'

So, if anyone wants another cube, **now** is the time to speak up. We'll probably add cubes later, but it may be a while.

[1] OLAP cubes: https://en.wikipedia.org/wiki/OLAP_cube

Attachments:

attachment.htm (text/html — 3.0 KB)

Show replies by date

Oliver Keyes

5 Jun 5 Jun

9:30 p.m.

My only thought is that "city" makes me uncomfortable. Did we track down a precise use case for that in the end?

On 5 June 2015 at 09:25, Dan Andreescu dandreescu@wikimedia.org wrote:

...

I just posted a comment on the famous task: https://phabricator.wikimedia.org/T44259#1341010 :)

Here it is for those who would rather discuss on this list:

We have finished analyzing the intermediate hourly aggregate with all the columns that we think are interesting. The data is too large to query and anonymize in real time. We'd rather get an API out faster than deal with that problem, so we decided to produce smaller "cubes" [1] of data for specific purposes. We have two cubes in mind and I'll explain those here. For each cube, we're aiming to have:

Direct access to a postgresql database in labs with the data

API access through RESTBase

Mondrian / Saiku access in labs for dimensional analysis

Data will be pre-aggregated so that any single data point has k-anonymity

(we have not determined a good k yet)

Higher level aggregations will be pre-computed so they use all data

And, the cubes are:

**stats.grok.se Cube: basic pageview data**

Hourly resolution. Will serve the same purpose as stats.grok.se has served for so many years. The dimensions available will be:

project - 'Project name from requests host name'

dialect - 'Dialect from requests path (not set if present in project

name)'

page_title - 'Page Title from requests path and query'

access_method - 'Method used to access the pages, can be desktop, mobile

web, or mobile app'

is_zero - 'accessed through a zero provider'

agent_type - 'Agent accessing the pages, can be spider or user'

referer_class - 'Can be internal, external or unknown'

**Geo Cube: geo-coded pageview data**

Daily resolution. Will allow researchers to track the flu, breaking news, etc. Dimensions will be:

project - 'Project name from requests hostname'

page_title - 'Page Title from requests path and query'

country_code - 'Country ISO code of the accessing agents (computed using

MaxMind GeoIP database)'

province - 'State / Province of the accessing agents (computed using

MaxMind GeoIP database)'

city - 'Metro area of the accessing agents (computed using MaxMind GeoIP

database)'

So, if anyone wants another cube, **now** is the time to speak up. We'll probably add cubes later, but it may be a while.

[1] OLAP cubes: https://en.wikipedia.org/wiki/OLAP_cube

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Dan Andreescu

9:35 p.m.

...

My only thought is that "city" makes me uncomfortable. Did we track down a precise use case for that in the end?

Yes, the Los Alamos National Lab folks' proposal: https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagevi...

We talked to them yesterday and it seems the time granularity is not as important. That's why that dataset is *daily* and the other one is *hourly*. Either way, these will be k-anonymized at any level. Once we have some data up, though, I'd love for people who are good at this to try and attack the datasets in combination and from different points of view like t-closeness, etc. I don't want to leak any info and any help on that is appreciated 'cause it's a hard problem.

Oliver Keyes

10:16 p.m.

Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals? It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?

For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.

On 5 June 2015 at 09:35, Dan Andreescu dandreescu@wikimedia.org wrote:

...

...
My only thought is that "city" makes me uncomfortable. Did we track down a precise use case for that in the end?

Yes, the Los Alamos National Lab folks' proposal: https://meta.wikimedia.org/wiki/Research:Geo-aggregation_of_Wikipedia_pagevi...

We talked to them yesterday and it seems the time granularity is not as important. That's why that dataset is *daily* and the other one is *hourly*. Either way, these will be k-anonymized at any level. Once we have some data up, though, I'd love for people who are good at this to try and attack the datasets in combination and from different points of view like t-closeness, etc. I don't want to leak any info and any help on that is appreciated 'cause it's a hard problem.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Dan Andreescu

10:38 p.m.

...

Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals?

Well, so the geo cube has to guess a bit at who would find it useful in the future.

...

It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?

For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.

I agree with you, but I'm not sure the data is risky if it's k-anonymous. Most likely, just doing that will limit the countries for which metro level data is available.

Oliver Keyes

11:39 p.m.

On 5 June 2015 at 10:38, Dan Andreescu dandreescu@wikimedia.org wrote:

...

...
Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals?

Well, so the geo cube has to guess a bit at who would find it useful in the future.

...
It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?

For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.

I agree with you, but I'm not sure the data is risky if it's k-anonymous. Most likely, just doing that will limit the countries for which metro level data is available.

I don't think it is if it is! As you said, though, we need to hammer on it for a while to make absolutely sure it's okay, and using lower-resolution data would not only make this easier but also reduce the cost of getting people wrong (geolocating people to MA is less dangerous than geolocating them to Arlington)

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Jon Katz

6 Jun 6 Jun

1:51 a.m.

Thanks Dan, and apologies if these are naive questions:

For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this data.

For apps can we see ios v android?

On Fri, Jun 5, 2015 at 8:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

On 5 June 2015 at 10:38, Dan Andreescu dandreescu@wikimedia.org wrote:

...
...
Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals?

Well, so the geo cube has to guess a bit at who would find it useful in

the

...
future.

...
It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?

For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.

I agree with you, but I'm not sure the data is risky if it's k-anonymous. Most likely, just doing that will limit the countries for which metro

level

...
data is available.

I don't think it is if it is! As you said, though, we need to hammer on it for a while to make absolutely sure it's okay, and using lower-resolution data would not only make this easier but also reduce the cost of getting people wrong (geolocating people to MA is less dangerous than geolocating them to Arlington)

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Kevin Leduc

2:49 a.m.

I came across another potential requirement from the WP Zero team: add the x-analytics['zero'] to the dimensions. This would allow the zero team to get pageviews per partner carrier. Our partners are interested in this data, however, they don't want to share it with anyone as it is competitive data, and we can't make it public.

On Fri, Jun 5, 2015 at 10:51 AM, Jon Katz jkatz@wikimedia.org wrote:

...

Thanks Dan, and apologies if these are naive questions:

For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this data.

For apps can we see ios v android?

On Fri, Jun 5, 2015 at 8:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
On 5 June 2015 at 10:38, Dan Andreescu dandreescu@wikimedia.org wrote:

...
...
Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals?

Well, so the geo cube has to guess a bit at who would find it useful in

the

...
future.

...
It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?

For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.

I agree with you, but I'm not sure the data is risky if it's

k-anonymous.

...
Most likely, just doing that will limit the countries for which metro

level

...
data is available.

I don't think it is if it is! As you said, though, we need to hammer on it for a while to make absolutely sure it's okay, and using lower-resolution data would not only make this easier but also reduce the cost of getting people wrong (geolocating people to MA is less dangerous than geolocating them to Arlington)

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Yuri Astrakhan

2:56 a.m.

Yep, the Zero tag would be very useful. Also, something zero team found highly useful are: via (proxy value in x-analytics), https vs http, zero sub domain vs m subdomain. On Jun 5, 2015 14:51, "Kevin Leduc" kevin@wikimedia.org wrote:

...

I came across another potential requirement from the WP Zero team: add the x-analytics['zero'] to the dimensions. This would allow the zero team to get pageviews per partner carrier. Our partners are interested in this data, however, they don't want to share it with anyone as it is competitive data, and we can't make it public.

On Fri, Jun 5, 2015 at 10:51 AM, Jon Katz jkatz@wikimedia.org wrote:

...
Thanks Dan, and apologies if these are naive questions:

For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this data.

For apps can we see ios v android?

On Fri, Jun 5, 2015 at 8:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
On 5 June 2015 at 10:38, Dan Andreescu dandreescu@wikimedia.org wrote:

...
...
Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals?

Well, so the geo cube has to guess a bit at who would find it useful

in the

...
future.

...
It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?

For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.

I agree with you, but I'm not sure the data is risky if it's

k-anonymous.

...
Most likely, just doing that will limit the countries for which metro

level

...
data is available.

I don't think it is if it is! As you said, though, we need to hammer on it for a while to make absolutely sure it's okay, and using lower-resolution data would not only make this easier but also reduce the cost of getting people wrong (geolocating people to MA is less dangerous than geolocating them to Arlington)

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Oliver Keyes

3:09 a.m.

If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.

On 5 June 2015 at 14:56, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...

Yep, the Zero tag would be very useful. Also, something zero team found highly useful are: via (proxy value in x-analytics), https vs http, zero sub domain vs m subdomain.

On Jun 5, 2015 14:51, "Kevin Leduc" kevin@wikimedia.org wrote:

...
I came across another potential requirement from the WP Zero team: add the x-analytics['zero'] to the dimensions. This would allow the zero team to get pageviews per partner carrier. Our partners are interested in this data, however, they don't want to share it with anyone as it is competitive data, and we can't make it public.

On Fri, Jun 5, 2015 at 10:51 AM, Jon Katz jkatz@wikimedia.org wrote:

...
Thanks Dan, and apologies if these are naive questions:

For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this data.

For apps can we see ios v android?

On Fri, Jun 5, 2015 at 8:39 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
On 5 June 2015 at 10:38, Dan Andreescu dandreescu@wikimedia.org wrote:

...
...
Gotcha. Reading that proposal it appears to be a proposal for a methodology that will enable future proposals; where are the future proposals?

Well, so the geo cube has to guess a bit at who would find it useful in the future.

...
It also says "in many countries, disease monitoring must be carried out at the state or metro-area level" - which countries have to be metro-level? Who are we risking the entire reader population for, here? Is it one country, or ten, or?

For what it's worth I love the idea of this kind of live stream. But I want to make sure that how the various chunks are being prioritised, and how critical they are to the outside world, is correlated - and is correlated with the underlying data's sensitivity, at that. If we're introducing risks by going down to city level and the actual use cases for city level data are limited, let's not do that - but this proposal doesn't provide thoughts on how limited those use cases are. It just says that it's required in some countries.

I agree with you, but I'm not sure the data is risky if it's k-anonymous. Most likely, just doing that will limit the countries for which metro level data is available.

I don't think it is if it is! As you said, though, we need to hammer on it for a while to make absolutely sure it's okay, and using lower-resolution data would not only make this easier but also reduce the cost of getting people wrong (geolocating people to MA is less dangerous than geolocating them to Arlington)

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Dan Andreescu

3:28 a.m.

On Fri, Jun 5, 2015 at 3:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...

If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.

Right, to clarify, this proposal is for a public data set and API.

...

...
...
...
Thanks Dan, and apologies if these are naive questions:

For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this

data.

...
...
...
For apps can we see ios v android?

Jon, we chose to not include that information in order to limit the amount of data that we'd have to deal with. If it gets too large, it won't fit into PostgreSQL. For the iOS / Android and beta / alpha versions of the site we can either:

* Make a new cube that examines site versions and client information * Just use the private data as we're already doing, but aggregate it hourly or daily as needed, to make analysis much faster.

Jon Katz

12 Jun 12 Jun

4:46 a.m.

Hi Dan, Sorry for the late response to this--

** Make a new cube that examines site versions and client information* ** Just use the private data as we're already doing, but aggregate it hourly or daily as needed, to make analysis much faster.*

How can I help add/keep this to/on your roadmap? -J

On Fri, Jun 5, 2015 at 12:28 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

On Fri, Jun 5, 2015 at 3:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.

Right, to clarify, this proposal is for a public data set and API.

...
...
...
...
Thanks Dan, and apologies if these are naive questions:

For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this

data.

...
...
...
For apps can we see ios v android?

Jon, we chose to not include that information in order to limit the amount of data that we'd have to deal with. If it gets too large, it won't fit into PostgreSQL. For the iOS / Android and beta / alpha versions of the site we can either:

Make a new cube that examines site versions and client information

Just use the private data as we're already doing, but aggregate it

hourly or daily as needed, to make analysis much faster.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Kevin Leduc

15 Jun 15 Jun

12:54 p.m.

In light of the recent switch to use HTTPS, what about adding http/https information. Maybe it can be added to the 'access_method' rather than adding a new dimension?

On Thu, Jun 11, 2015 at 1:46 PM, Jon Katz jkatz@wikimedia.org wrote:

...

Hi Dan, Sorry for the late response to this--

** Make a new cube that examines site versions and client information* ** Just use the private data as we're already doing, but aggregate it hourly or daily as needed, to make analysis much faster.*

How can I help add/keep this to/on your roadmap? -J

On Fri, Jun 5, 2015 at 12:28 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
On Fri, Jun 5, 2015 at 3:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.

Right, to clarify, this proposal is for a public data set and API.

...
...
...
...
Thanks Dan, and apologies if these are naive questions:

For mobile web can we also see beta v. stable? This is important for tracking prototypes, which is one of the core product uses for this

data.

...
...
...
For apps can we see ios v android?

Jon, we chose to not include that information in order to limit the amount of data that we'd have to deal with. If it gets too large, it won't fit into PostgreSQL. For the iOS / Android and beta / alpha versions of the site we can either:

Make a new cube that examines site versions and client information

Just use the private data as we're already doing, but aggregate it

hourly or daily as needed, to make analysis much faster.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Yuri Astrakhan

12:57 p.m.

X-analytics contains HTTPS=1 for all text hits, but not for other types of traffic On Jun 15, 2015 00:54, "Kevin Leduc" kevin@wikimedia.org wrote:

...

In light of the recent switch to use HTTPS, what about adding http/https information. Maybe it can be added to the 'access_method' rather than adding a new dimension?

On Thu, Jun 11, 2015 at 1:46 PM, Jon Katz jkatz@wikimedia.org wrote:

...
Hi Dan, Sorry for the late response to this--

** Make a new cube that examines site versions and client information* ** Just use the private data as we're already doing, but aggregate it hourly or daily as needed, to make analysis much faster.*

How can I help add/keep this to/on your roadmap? -J

On Fri, Jun 5, 2015 at 12:28 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
On Fri, Jun 5, 2015 at 3:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.

Right, to clarify, this proposal is for a public data set and API.

...
...
...
> Thanks Dan, and apologies if these are naive questions: > > For mobile web can we also see beta v. stable? This is important

for

...
...
> tracking prototypes, which is one of the core product uses for this

data.

...
...
> > For apps can we see ios v android?

Jon, we chose to not include that information in order to limit the amount of data that we'd have to deal with. If it gets too large, it won't fit into PostgreSQL. For the iOS / Android and beta / alpha versions of the site we can either:

Make a new cube that examines site versions and client information

Just use the private data as we're already doing, but aggregate it

hourly or daily as needed, to make analysis much faster.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Oliver Keyes

9:40 p.m.

Well, that's not the case; https=1 was added for the apps, and so hit both mobile and text varnishes. Since all pageviews go through the text or mobile sources, all pageviews note (implicitly or explicitly) their https status.

In regards to the initial suggestion; no. Don't add new fields. Don't propose new fields. Don't do anything with new fields - freeze the definition already. We've had a pageviews definition for 6 months, we've had unreliability in Henrik's third-party service for 12, and every time there's a "should we add a new field?" proposal it slows implementing an alternative down. We need that alternative.

On 15 June 2015 at 00:57, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...

X-analytics contains HTTPS=1 for all text hits, but not for other types of traffic

On Jun 15, 2015 00:54, "Kevin Leduc" kevin@wikimedia.org wrote:

...
In light of the recent switch to use HTTPS, what about adding http/https information. Maybe it can be added to the 'access_method' rather than adding a new dimension?

On Thu, Jun 11, 2015 at 1:46 PM, Jon Katz jkatz@wikimedia.org wrote:

...
Hi Dan, Sorry for the late response to this--

Make a new cube that examines site versions and client information

Just use the private data as we're already doing, but aggregate it

hourly or daily as needed, to make analysis much faster.

How can I help add/keep this to/on your roadmap? -J

On Fri, Jun 5, 2015 at 12:28 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
On Fri, Jun 5, 2015 at 3:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.

Right, to clarify, this proposal is for a public data set and API.

...
...
>> Thanks Dan, and apologies if these are naive questions: >> >> For mobile web can we also see beta v. stable? This is important >> for >> tracking prototypes, which is one of the core product uses for this >> data. >> >> For apps can we see ios v android?

Jon, we chose to not include that information in order to limit the amount of data that we'd have to deal with. If it gets too large, it won't fit into PostgreSQL. For the iOS / Android and beta / alpha versions of the site we can either:

Make a new cube that examines site versions and client information

Just use the private data as we're already doing, but aggregate it

hourly or daily as needed, to make analysis much faster.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Dan Andreescu

25 Jun 25 Jun

9:37 p.m.

Two quick updates:

What Oliver said resonates with us, we are doing everything possible to focus and keep the project moving instead of satisfying all possible requirements at launch.

We have been working our goals (not yet finalized) to include "Pageview API by September". There is quite a bit of puppetizing and productionizing to do, but we are removing more and more distractions so I personally feel optimistic.

Also, Jon K and other internal folks, the intermediate aggregate is available for you to query and it's updated hourly.

On Mon, Jun 15, 2015 at 9:40 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Well, that's not the case; https=1 was added for the apps, and so hit both mobile and text varnishes. Since all pageviews go through the text or mobile sources, all pageviews note (implicitly or explicitly) their https status.

In regards to the initial suggestion; no. Don't add new fields. Don't propose new fields. Don't do anything with new fields - freeze the definition already. We've had a pageviews definition for 6 months, we've had unreliability in Henrik's third-party service for 12, and every time there's a "should we add a new field?" proposal it slows implementing an alternative down. We need that alternative.

On 15 June 2015 at 00:57, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...
X-analytics contains HTTPS=1 for all text hits, but not for other types

of

...
traffic

On Jun 15, 2015 00:54, "Kevin Leduc" kevin@wikimedia.org wrote:

...
In light of the recent switch to use HTTPS, what about adding http/https information. Maybe it can be added to the 'access_method' rather than adding a new dimension?

On Thu, Jun 11, 2015 at 1:46 PM, Jon Katz jkatz@wikimedia.org wrote:

...
Hi Dan, Sorry for the late response to this--

Make a new cube that examines site versions and client information

Just use the private data as we're already doing, but aggregate it

hourly or daily as needed, to make analysis much faster.

How can I help add/keep this to/on your roadmap? -J

On Fri, Jun 5, 2015 at 12:28 PM, Dan Andreescu <

dandreescu@wikimedia.org>

...
...
...
wrote:

...
On Fri, Jun 5, 2015 at 3:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
If we can't share it with the public then it seems like it shouldn't be part of a proposal for an API.

Right, to clarify, this proposal is for a public data set and API.

...
>>> Thanks Dan, and apologies if these are naive questions: >>> >>> For mobile web can we also see beta v. stable? This is important >>> for >>> tracking prototypes, which is one of the core product uses for

this

...
...
...
...
...
>>> data. >>> >>> For apps can we see ios v android?

Jon, we chose to not include that information in order to limit the amount of data that we'd have to deal with. If it gets too large, it

won't

...
...
...
...
fit into PostgreSQL. For the iOS / Android and beta / alpha versions

of the

...
...
...
...
site we can either:

Make a new cube that examines site versions and client information

Just use the private data as we're already doing, but aggregate it

hourly or daily as needed, to make analysis much faster.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

3477

Age (days ago)

3497

Last active (days ago)

analytics@lists.wikimedia.org

15 comments

5 participants

tags (0)

participants (5)

Dan Andreescu
Jon Katz
Kevin Leduc
Oliver Keyes
Yuri Astrakhan