Hey folks,
We've been discussing ways to make more Wikimedia data public. One of our sources for data is EventLogging (EL)[1], a system that lets us track events on both the client and server-side. Recently, YuviPanda and springle have been working with us to figure out what issues need to be resolved in order to begin loading EL events that contain public data[2] into LabsDB for public consumption and for use in WikiMetrics.
It looks like there are three major concerns about directing EL to LabsDB. (1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive, (2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to be addressed to make sure that we don't over-utilize labs infrastructure and (3) we'll need signoff from legal.
It looks like (2) can be taken care of independently from (1) and (3). Is this bug already prioritized, and if not, could it be?
1. https://www.mediawiki.org/wiki/Extension:EventLogging 2. Eventually, we'll want a means to sanitize and surface events that contain sensitive information, but I'd argue that is a second step that we should address later since it will likely require more substantial technical work.
-Aaron
Aaron,
The bug does not have to do with making data public. It has to do with how data is inserted in to EL from the consumers, so it deals with the 'system', not the 'data'. The raw data as inserted cannot be replicated directly to be made public so whether inserts are more efficient does not affect the public/private discussion.
(1) there needs to be a good review process in place to make sure that the
data we surface isn't sensitive There is a bunch of work involved on this item. For example: per our privacy policy some of this data should be discarded after 90 days and currently it is not. Also, you are aware of the discussions under sanitization: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
Basically to make EL data public it needs to be aggregated with a level of anonymization we think is acceptable. There is quite a bit of work on this regard, here are some bugs that were filed a while back:
https://bugzilla.wikimedia.org/show_bug.cgi?id=62978
https://bugzilla.wikimedia.org/show_bug.cgi?id=59832
On Wed, Aug 13, 2014 at 3:39 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Hey folks,
We've been discussing ways to make more Wikimedia data public. One of our sources for data is EventLogging (EL)[1], a system that lets us track events on both the client and server-side. Recently, YuviPanda and springle have been working with us to figure out what issues need to be resolved in order to begin loading EL events that contain public data[2] into LabsDB for public consumption and for use in WikiMetrics.
It looks like there are three major concerns about directing EL to LabsDB. (1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive, (2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to be addressed to make sure that we don't over-utilize labs infrastructure and (3) we'll need signoff from legal.
It looks like (2) can be taken care of independently from (1) and (3). Is this bug already prioritized, and if not, could it be?
- https://www.mediawiki.org/wiki/Extension:EventLogging
- Eventually, we'll want a means to sanitize and surface events that
contain sensitive information, but I'd argue that is a second step that we should address later since it will likely require more substantial technical work.
-Aaron
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re. (2), I didn't say anything about that being related to public/private. This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently. That bug is about efficiently batching inserts.
I don't know what you are talking about re. 90 day purges. I'm talking about 100% public Event logging events -- E.g. https://meta.wikimedia.org/wiki/Schema:PageMove Also, we do *not* need to purge EventLogging event data at 90 days. We need to purge PII at 90 days. We generally do not store PII in EventLogging events, but when we do, we organize 90 days purges as we have recently for the anonymous editor experiments. If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the EventLogging capsule, not the event.
Re. (1), we are already performing this review internally in order to determine what does and does not conform to the Data Retention Guidelines. It seems clear that a robust process could also identify non-sensitive Schemas that could be published in labs.
-Aaron
On Wed, Aug 13, 2014 at 5:00 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Aaron,
The bug does not have to do with making data public. It has to do with how data is inserted in to EL from the consumers, so it deals with the 'system', not the 'data'. The raw data as inserted cannot be replicated directly to be made public so whether inserts are more efficient does not affect the public/private discussion.
(1) there needs to be a good review process in place to make sure that
the data we surface isn't sensitive There is a bunch of work involved on this item. For example: per our privacy policy some of this data should be discarded after 90 days and currently it is not. Also, you are aware of the discussions under sanitization: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
Basically to make EL data public it needs to be aggregated with a level of anonymization we think is acceptable. There is quite a bit of work on this regard, here are some bugs that were filed a while back:
https://bugzilla.wikimedia.org/show_bug.cgi?id=62978
https://bugzilla.wikimedia.org/show_bug.cgi?id=59832
On Wed, Aug 13, 2014 at 3:39 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Hey folks,
We've been discussing ways to make more Wikimedia data public. One of our sources for data is EventLogging (EL)[1], a system that lets us track events on both the client and server-side. Recently, YuviPanda and springle have been working with us to figure out what issues need to be resolved in order to begin loading EL events that contain public data[2] into LabsDB for public consumption and for use in WikiMetrics.
It looks like there are three major concerns about directing EL to LabsDB. (1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive, (2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to be addressed to make sure that we don't over-utilize labs infrastructure and (3) we'll need signoff from legal.
It looks like (2) can be taken care of independently from (1) and (3). Is this bug already prioritized, and if not, could it be?
- https://www.mediawiki.org/wiki/Extension:EventLogging
- Eventually, we'll want a means to sanitize and surface events that
contain sensitive information, but I'd argue that is a second step that we should address later since it will likely require more substantial technical work.
-Aaron
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Re. (2), I didn't say anything about that being related to
public/private.
This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently. That bug is
about efficiently batching inserts. ah, my mistake. Kevin can do prioritization as needed.
If you are concerned about UserAgents as the sanitization page you linked
to suggests, then we should talk about the >EventLogging capsule, not the event. If you want to be so precise, sure, that is correct. Note that currently there is no distinction in storage as to the event and the capsule, they are stored together in the same record. Capsule data is only identified by a prefix on the column name. It stands to reason that you would be interested on the capsule too as it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
On Wed, Aug 13, 2014 at 6:47 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Re. (2), I didn't say anything about that being related to public/private. This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently. That bug is about efficiently batching inserts.
I don't know what you are talking about re. 90 day purges. I'm talking about 100% public Event logging events -- E.g. https://meta.wikimedia.org/wiki/Schema:PageMove Also, we do *not* need to purge EventLogging event data at 90 days. We need to purge PII at 90 days. We generally do not store PII in EventLogging events, but when we do, we organize 90 days purges as we have recently for the anonymous editor experiments. If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the EventLogging capsule, not the event.
Re. (1), we are already performing this review internally in order to determine what does and does not conform to the Data Retention Guidelines. It seems clear that a robust process could also identify non-sensitive Schemas that could be published in labs.
-Aaron
On Wed, Aug 13, 2014 at 5:00 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Aaron,
The bug does not have to do with making data public. It has to do with how data is inserted in to EL from the consumers, so it deals with the 'system', not the 'data'. The raw data as inserted cannot be replicated directly to be made public so whether inserts are more efficient does not affect the public/private discussion.
(1) there needs to be a good review process in place to make sure that
the data we surface isn't sensitive There is a bunch of work involved on this item. For example: per our privacy policy some of this data should be discarded after 90 days and currently it is not. Also, you are aware of the discussions under sanitization: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
Basically to make EL data public it needs to be aggregated with a level of anonymization we think is acceptable. There is quite a bit of work on this regard, here are some bugs that were filed a while back:
https://bugzilla.wikimedia.org/show_bug.cgi?id=62978
https://bugzilla.wikimedia.org/show_bug.cgi?id=59832
On Wed, Aug 13, 2014 at 3:39 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Hey folks,
We've been discussing ways to make more Wikimedia data public. One of our sources for data is EventLogging (EL)[1], a system that lets us track events on both the client and server-side. Recently, YuviPanda and springle have been working with us to figure out what issues need to be resolved in order to begin loading EL events that contain public data[2] into LabsDB for public consumption and for use in WikiMetrics.
It looks like there are three major concerns about directing EL to LabsDB. (1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive, (2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to be addressed to make sure that we don't over-utilize labs infrastructure and (3) we'll need signoff from legal.
It looks like (2) can be taken care of independently from (1) and (3). Is this bug already prioritized, and if not, could it be?
- https://www.mediawiki.org/wiki/Extension:EventLogging
- Eventually, we'll want a means to sanitize and surface events that
contain sensitive information, but I'd argue that is a second step that we should address later since it will likely require more substantial technical work.
-Aaron
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Excellent. Kevin, can you work to get that bug[1] prioritized and let us know? I can start working with R&D on a proposal to bring to legal.
1. https://bugzilla.wikimedia.org/show_bug.cgi?id=67450
It stands to reason that you would be interested on the capsule too as it
holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
Fair enough. I think we can drop that one column from the capsule and be quite happy with the rest. No need to purge EventLogging.
-Aaron
On Wed, Aug 13, 2014 at 6:08 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Re. (2), I didn't say anything about that being related to
public/private.
This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently. That bug is
about efficiently batching inserts. ah, my mistake. Kevin can do prioritization as needed.
If you are concerned about UserAgents as the sanitization page you linked
to suggests, then we should talk about the >EventLogging capsule, not the event. If you want to be so precise, sure, that is correct. Note that currently there is no distinction in storage as to the event and the capsule, they are stored together in the same record. Capsule data is only identified by a prefix on the column name. It stands to reason that you would be interested on the capsule too as it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
On Wed, Aug 13, 2014 at 6:47 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Re. (2), I didn't say anything about that being related to public/private. This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently. That bug is about efficiently batching inserts.
I don't know what you are talking about re. 90 day purges. I'm talking about 100% public Event logging events -- E.g. https://meta.wikimedia.org/wiki/Schema:PageMove Also, we do *not* need to purge EventLogging event data at 90 days. We need to purge PII at 90 days. We generally do not store PII in EventLogging events, but when we do, we organize 90 days purges as we have recently for the anonymous editor experiments. If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the EventLogging capsule, not the event.
Re. (1), we are already performing this review internally in order to determine what does and does not conform to the Data Retention Guidelines. It seems clear that a robust process could also identify non-sensitive Schemas that could be published in labs.
-Aaron
On Wed, Aug 13, 2014 at 5:00 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Aaron,
The bug does not have to do with making data public. It has to do with how data is inserted in to EL from the consumers, so it deals with the 'system', not the 'data'. The raw data as inserted cannot be replicated directly to be made public so whether inserts are more efficient does not affect the public/private discussion.
(1) there needs to be a good review process in place to make sure that
the data we surface isn't sensitive There is a bunch of work involved on this item. For example: per our privacy policy some of this data should be discarded after 90 days and currently it is not. Also, you are aware of the discussions under sanitization: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
Basically to make EL data public it needs to be aggregated with a level of anonymization we think is acceptable. There is quite a bit of work on this regard, here are some bugs that were filed a while back:
https://bugzilla.wikimedia.org/show_bug.cgi?id=62978
https://bugzilla.wikimedia.org/show_bug.cgi?id=59832
On Wed, Aug 13, 2014 at 3:39 PM, Aaron Halfaker <ahalfaker@wikimedia.org
wrote:
Hey folks,
We've been discussing ways to make more Wikimedia data public. One of our sources for data is EventLogging (EL)[1], a system that lets us track events on both the client and server-side. Recently, YuviPanda and springle have been working with us to figure out what issues need to be resolved in order to begin loading EL events that contain public data[2] into LabsDB for public consumption and for use in WikiMetrics.
It looks like there are three major concerns about directing EL to LabsDB. (1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive, (2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to be addressed to make sure that we don't over-utilize labs infrastructure and (3) we'll need signoff from legal.
It looks like (2) can be taken care of independently from (1) and (3). Is this bug already prioritized, and if not, could it be?
- https://www.mediawiki.org/wiki/Extension:EventLogging
- Eventually, we'll want a means to sanitize and surface events that
contain sensitive information, but I'd argue that is a second step that we should address later since it will likely require more substantial technical work.
-Aaron
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Aaron,
I was not planning on prioritizing any EventLogging work for the rest of this quarter. The analytics dev team has a goal to get an EEVS dashboard running and I want to keep them focused otherwise we will not reach this goal.
I'm tempted to ask what springle and YuviPanda can accomplish without the help of the analytics devs, but even that will imply discussions and distractions from our goals.
In September I am planning on looking at what goals we can set for the next quarter and look at what we want to accomplish with EventLogging. I was going to prioritize it at that point.
On Wed, Aug 13, 2014 at 10:28 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Excellent. Kevin, can you work to get that bug[1] prioritized and let us know? I can start working with R&D on a proposal to bring to legal.
It stands to reason that you would be interested on the capsule too as it
holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
Fair enough. I think we can drop that one column from the capsule and be quite happy with the rest. No need to purge EventLogging.
-Aaron
On Wed, Aug 13, 2014 at 6:08 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Re. (2), I didn't say anything about that being related to
public/private.
This is a request from springle -- that if we are going to start
pushing
Events to LabsDB, he'd like us to do so more efficiently. That bug is
about efficiently batching inserts. ah, my mistake. Kevin can do prioritization as needed.
If you are concerned about UserAgents as the sanitization page you
linked to suggests, then we should talk about the >EventLogging capsule, not the event. If you want to be so precise, sure, that is correct. Note that currently there is no distinction in storage as to the event and the capsule, they are stored together in the same record. Capsule data is only identified by a prefix on the column name. It stands to reason that you would be interested on the capsule too as it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
On Wed, Aug 13, 2014 at 6:47 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Re. (2), I didn't say anything about that being related to public/private. This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently. That bug is about efficiently batching inserts.
I don't know what you are talking about re. 90 day purges. I'm talking about 100% public Event logging events -- E.g. https://meta.wikimedia.org/wiki/Schema:PageMove Also, we do *not* need to purge EventLogging event data at 90 days. We need to purge PII at 90 days. We generally do not store PII in EventLogging events, but when we do, we organize 90 days purges as we have recently for the anonymous editor experiments. If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the EventLogging capsule, not the event.
Re. (1), we are already performing this review internally in order to determine what does and does not conform to the Data Retention Guidelines. It seems clear that a robust process could also identify non-sensitive Schemas that could be published in labs.
-Aaron
On Wed, Aug 13, 2014 at 5:00 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Aaron,
The bug does not have to do with making data public. It has to do with how data is inserted in to EL from the consumers, so it deals with the 'system', not the 'data'. The raw data as inserted cannot be replicated directly to be made public so whether inserts are more efficient does not affect the public/private discussion.
(1) there needs to be a good review process in place to make sure
that the data we surface isn't sensitive There is a bunch of work involved on this item. For example: per our privacy policy some of this data should be discarded after 90 days and currently it is not. Also, you are aware of the discussions under sanitization: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
Basically to make EL data public it needs to be aggregated with a level of anonymization we think is acceptable. There is quite a bit of work on this regard, here are some bugs that were filed a while back:
https://bugzilla.wikimedia.org/show_bug.cgi?id=62978
https://bugzilla.wikimedia.org/show_bug.cgi?id=59832
On Wed, Aug 13, 2014 at 3:39 PM, Aaron Halfaker < ahalfaker@wikimedia.org> wrote:
Hey folks,
We've been discussing ways to make more Wikimedia data public. One of our sources for data is EventLogging (EL)[1], a system that lets us track events on both the client and server-side. Recently, YuviPanda and springle have been working with us to figure out what issues need to be resolved in order to begin loading EL events that contain public data[2] into LabsDB for public consumption and for use in WikiMetrics.
It looks like there are three major concerns about directing EL to LabsDB. (1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive, (2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to be addressed to make sure that we don't over-utilize labs infrastructure and (3) we'll need signoff from legal.
It looks like (2) can be taken care of independently from (1) and (3). Is this bug already prioritized, and if not, could it be?
- https://www.mediawiki.org/wiki/Extension:EventLogging
- Eventually, we'll want a means to sanitize and surface events that
contain sensitive information, but I'd argue that is a second step that we should address later since it will likely require more substantial technical work.
-Aaron
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
OK. Sounds reasonable. Sorry to seem as though I am pushing on you & the devs. In fact, specifying that you won't have the bandwidth to even consider the bug until next quarter gives me the power to push on others.
:)
Thanks! -Aaron
On Wed, Aug 13, 2014 at 8:56 PM, Kevin Leduc kevin@wikimedia.org wrote:
Hi Aaron,
I was not planning on prioritizing any EventLogging work for the rest of this quarter. The analytics dev team has a goal to get an EEVS dashboard running and I want to keep them focused otherwise we will not reach this goal.
I'm tempted to ask what springle and YuviPanda can accomplish without the help of the analytics devs, but even that will imply discussions and distractions from our goals.
In September I am planning on looking at what goals we can set for the next quarter and look at what we want to accomplish with EventLogging. I was going to prioritize it at that point.
On Wed, Aug 13, 2014 at 10:28 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Excellent. Kevin, can you work to get that bug[1] prioritized and let us know? I can start working with R&D on a proposal to bring to legal.
It stands to reason that you would be interested on the capsule too as it
holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
Fair enough. I think we can drop that one column from the capsule and be quite happy with the rest. No need to purge EventLogging.
-Aaron
On Wed, Aug 13, 2014 at 6:08 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Re. (2), I didn't say anything about that being related to
public/private.
This is a request from springle -- that if we are going to start
pushing
Events to LabsDB, he'd like us to do so more efficiently. That bug is
about efficiently batching inserts. ah, my mistake. Kevin can do prioritization as needed.
If you are concerned about UserAgents as the sanitization page you
linked to suggests, then we should talk about the >EventLogging capsule, not the event. If you want to be so precise, sure, that is correct. Note that currently there is no distinction in storage as to the event and the capsule, they are stored together in the same record. Capsule data is only identified by a prefix on the column name. It stands to reason that you would be interested on the capsule too as it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
On Wed, Aug 13, 2014 at 6:47 PM, Aaron Halfaker <ahalfaker@wikimedia.org
wrote:
Re. (2), I didn't say anything about that being related to public/private. This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently. That bug is about efficiently batching inserts.
I don't know what you are talking about re. 90 day purges. I'm talking about 100% public Event logging events -- E.g. https://meta.wikimedia.org/wiki/Schema:PageMove Also, we do *not* need to purge EventLogging event data at 90 days. We need to purge PII at 90 days. We generally do not store PII in EventLogging events, but when we do, we organize 90 days purges as we have recently for the anonymous editor experiments. If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the EventLogging capsule, not the event.
Re. (1), we are already performing this review internally in order to determine what does and does not conform to the Data Retention Guidelines. It seems clear that a robust process could also identify non-sensitive Schemas that could be published in labs.
-Aaron
On Wed, Aug 13, 2014 at 5:00 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Aaron,
The bug does not have to do with making data public. It has to do with how data is inserted in to EL from the consumers, so it deals with the 'system', not the 'data'. The raw data as inserted cannot be replicated directly to be made public so whether inserts are more efficient does not affect the public/private discussion.
(1) there needs to be a good review process in place to make sure
that the data we surface isn't sensitive There is a bunch of work involved on this item. For example: per our privacy policy some of this data should be discarded after 90 days and currently it is not. Also, you are aware of the discussions under sanitization: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
Basically to make EL data public it needs to be aggregated with a level of anonymization we think is acceptable. There is quite a bit of work on this regard, here are some bugs that were filed a while back:
https://bugzilla.wikimedia.org/show_bug.cgi?id=62978
https://bugzilla.wikimedia.org/show_bug.cgi?id=59832
On Wed, Aug 13, 2014 at 3:39 PM, Aaron Halfaker < ahalfaker@wikimedia.org> wrote:
Hey folks,
We've been discussing ways to make more Wikimedia data public. One of our sources for data is EventLogging (EL)[1], a system that lets us track events on both the client and server-side. Recently, YuviPanda and springle have been working with us to figure out what issues need to be resolved in order to begin loading EL events that contain public data[2] into LabsDB for public consumption and for use in WikiMetrics.
It looks like there are three major concerns about directing EL to LabsDB. (1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive, (2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to be addressed to make sure that we don't over-utilize labs infrastructure and (3) we'll need signoff from legal.
It looks like (2) can be taken care of independently from (1) and (3). Is this bug already prioritized, and if not, could it be?
- https://www.mediawiki.org/wiki/Extension:EventLogging
- Eventually, we'll want a means to sanitize and surface events that
contain sensitive information, but I'd argue that is a second step that we should address later since it will likely require more substantial technical work.
-Aaron
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Kevin, for what it's worth I don't think that bug that Sean is asking for is that challenging. The relevant part we'd have to change is really just a few lines [1]. I respect your decision of course, but I just wanted to point out that this issue does drive towards some of our goals, as we talked a bit about getting EventLogging data to be usable by Wikimetrics, and this is the first step.
[1] - https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging/4d917e1...
On Wed, Aug 13, 2014 at 4:19 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
OK. Sounds reasonable. Sorry to seem as though I am pushing on you & the devs. In fact, specifying that you won't have the bandwidth to even consider the bug until next quarter gives me the power to push on others.
:)
Thanks! -Aaron
On Wed, Aug 13, 2014 at 8:56 PM, Kevin Leduc kevin@wikimedia.org wrote:
Hi Aaron,
I was not planning on prioritizing any EventLogging work for the rest of this quarter. The analytics dev team has a goal to get an EEVS dashboard running and I want to keep them focused otherwise we will not reach this goal.
I'm tempted to ask what springle and YuviPanda can accomplish without the help of the analytics devs, but even that will imply discussions and distractions from our goals.
In September I am planning on looking at what goals we can set for the next quarter and look at what we want to accomplish with EventLogging. I was going to prioritize it at that point.
On Wed, Aug 13, 2014 at 10:28 AM, Aaron Halfaker <ahalfaker@wikimedia.org
wrote:
Excellent. Kevin, can you work to get that bug[1] prioritized and let us know? I can start working with R&D on a proposal to bring to legal.
It stands to reason that you would be interested on the capsule too as
it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
Fair enough. I think we can drop that one column from the capsule and be quite happy with the rest. No need to purge EventLogging.
-Aaron
On Wed, Aug 13, 2014 at 6:08 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Re. (2), I didn't say anything about that being related to
public/private.
This is a request from springle -- that if we are going to start
pushing
Events to LabsDB, he'd like us to do so more efficiently. That bug
is about efficiently batching inserts. ah, my mistake. Kevin can do prioritization as needed.
If you are concerned about UserAgents as the sanitization page you
linked to suggests, then we should talk about the >EventLogging capsule, not the event. If you want to be so precise, sure, that is correct. Note that currently there is no distinction in storage as to the event and the capsule, they are stored together in the same record. Capsule data is only identified by a prefix on the column name. It stands to reason that you would be interested on the capsule too as it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
On Wed, Aug 13, 2014 at 6:47 PM, Aaron Halfaker < ahalfaker@wikimedia.org> wrote:
Re. (2), I didn't say anything about that being related to public/private. This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently. That bug is about efficiently batching inserts.
I don't know what you are talking about re. 90 day purges. I'm talking about 100% public Event logging events -- E.g. https://meta.wikimedia.org/wiki/Schema:PageMove Also, we do *not* need to purge EventLogging event data at 90 days. We need to purge PII at 90 days. We generally do not store PII in EventLogging events, but when we do, we organize 90 days purges as we have recently for the anonymous editor experiments. If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the EventLogging capsule, not the event.
Re. (1), we are already performing this review internally in order to determine what does and does not conform to the Data Retention Guidelines. It seems clear that a robust process could also identify non-sensitive Schemas that could be published in labs.
-Aaron
On Wed, Aug 13, 2014 at 5:00 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Aaron,
>(2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 The bug does not have to do with making data public. It has to do with how data is inserted in to EL from the consumers, so it deals with the 'system', not the 'data'. The raw data as inserted cannot be replicated directly to be made public so whether inserts are more efficient does not affect the public/private discussion.
>(1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive There is a bunch of work involved on this item. For example: per our privacy policy some of this data should be discarded after 90 days and currently it is not. Also, you are aware of the discussions under sanitization: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
Basically to make EL data public it needs to be aggregated with a level of anonymization we think is acceptable. There is quite a bit of work on this regard, here are some bugs that were filed a while back:
https://bugzilla.wikimedia.org/show_bug.cgi?id=62978
https://bugzilla.wikimedia.org/show_bug.cgi?id=59832
On Wed, Aug 13, 2014 at 3:39 PM, Aaron Halfaker < ahalfaker@wikimedia.org> wrote:
> Hey folks, > > We've been discussing ways to make more Wikimedia data public. One > of our sources for data is EventLogging (EL)[1], a system that lets us > track events on both the client and server-side. Recently, YuviPanda and > springle have been working with us to figure out what issues need to be > resolved in order to begin loading EL events that contain public data[2] > into LabsDB for public consumption and for use in WikiMetrics. > > It looks like there are three major concerns about directing EL to > LabsDB. (1) there needs to be a good review process in place to make sure > that the data we surface isn't sensitive, (2) > https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to > be addressed to make sure that we don't over-utilize labs infrastructure > and (3) we'll need signoff from legal. > > It looks like (2) can be taken care of independently from (1) and > (3). Is this bug already prioritized, and if not, could it be? > > 1. https://www.mediawiki.org/wiki/Extension:EventLogging > 2. Eventually, we'll want a means to sanitize and surface events > that contain sensitive information, but I'd argue that is a second step > that we should address later since it will likely require more substantial > technical work. > > -Aaron > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
(expanding on what I think Dan is referring to re: goals), addressing this issue would allow EEVS to access data needed to generate breakdowns for metrics by method/target site (mobile, desktop, apps).
On Aug 13, 2014, at 1:40 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Kevin, for what it's worth I don't think that bug that Sean is asking for is that challenging. The relevant part we'd have to change is really just a few lines [1]. I respect your decision of course, but I just wanted to point out that this issue does drive towards some of our goals, as we talked a bit about getting EventLogging data to be usable by Wikimetrics, and this is the first step.
[1] - https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging/4d917e1...
On Wed, Aug 13, 2014 at 4:19 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote: OK. Sounds reasonable. Sorry to seem as though I am pushing on you & the devs. In fact, specifying that you won't have the bandwidth to even consider the bug until next quarter gives me the power to push on others. >:)
Thanks! -Aaron
On Wed, Aug 13, 2014 at 8:56 PM, Kevin Leduc kevin@wikimedia.org wrote: Hi Aaron,
I was not planning on prioritizing any EventLogging work for the rest of this quarter. The analytics dev team has a goal to get an EEVS dashboard running and I want to keep them focused otherwise we will not reach this goal.
I'm tempted to ask what springle and YuviPanda can accomplish without the help of the analytics devs, but even that will imply discussions and distractions from our goals.
In September I am planning on looking at what goals we can set for the next quarter and look at what we want to accomplish with EventLogging. I was going to prioritize it at that point.
On Wed, Aug 13, 2014 at 10:28 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote: Excellent. Kevin, can you work to get that bug[1] prioritized and let us know? I can start working with R&D on a proposal to bring to legal.
It stands to reason that you would be interested on the capsule too as it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
Fair enough. I think we can drop that one column from the capsule and be quite happy with the rest. No need to purge EventLogging.
-Aaron
On Wed, Aug 13, 2014 at 6:08 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Re. (2), I didn't say anything about that being related to public/private. This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently. That bug is about efficiently batching inserts.
ah, my mistake. Kevin can do prioritization as needed.
If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the >EventLogging capsule, not the event.
If you want to be so precise, sure, that is correct. Note that currently there is no distinction in storage as to the event and the capsule, they are stored together in the same record. Capsule data is only identified by a prefix on the column name. It stands to reason that you would be interested on the capsule too as it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
On Wed, Aug 13, 2014 at 6:47 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote: Re. (2), I didn't say anything about that being related to public/private. This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently. That bug is about efficiently batching inserts.
I don't know what you are talking about re. 90 day purges. I'm talking about 100% public Event logging events -- E.g. https://meta.wikimedia.org/wiki/Schema:PageMove Also, we do *not* need to purge EventLogging event data at 90 days. We need to purge PII at 90 days. We generally do not store PII in EventLogging events, but when we do, we organize 90 days purges as we have recently for the anonymous editor experiments. If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the EventLogging capsule, not the event.
Re. (1), we are already performing this review internally in order to determine what does and does not conform to the Data Retention Guidelines. It seems clear that a robust process could also identify non-sensitive Schemas that could be published in labs.
-Aaron
On Wed, Aug 13, 2014 at 5:00 PM, Nuria Ruiz nuria@wikimedia.org wrote: Aaron,
The bug does not have to do with making data public. It has to do with how data is inserted in to EL from the consumers, so it deals with the 'system', not the 'data'. The raw data as inserted cannot be replicated directly to be made public so whether inserts are more efficient does not affect the public/private discussion.
(1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive
There is a bunch of work involved on this item. For example: per our privacy policy some of this data should be discarded after 90 days and currently it is not. Also, you are aware of the discussions under sanitization: https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
Basically to make EL data public it needs to be aggregated with a level of anonymization we think is acceptable. There is quite a bit of work on this regard, here are some bugs that were filed a while back:
https://bugzilla.wikimedia.org/show_bug.cgi?id=62978
https://bugzilla.wikimedia.org/show_bug.cgi?id=59832
On Wed, Aug 13, 2014 at 3:39 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote: Hey folks,
We've been discussing ways to make more Wikimedia data public. One of our sources for data is EventLogging (EL)[1], a system that lets us track events on both the client and server-side. Recently, YuviPanda and springle have been working with us to figure out what issues need to be resolved in order to begin loading EL events that contain public data[2] into LabsDB for public consumption and for use in WikiMetrics.
It looks like there are three major concerns about directing EL to LabsDB. (1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive, (2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to be addressed to make sure that we don't over-utilize labs infrastructure and (3) we'll need signoff from legal.
It looks like (2) can be taken care of independently from (1) and (3). Is this bug already prioritized, and if not, could it be?
- https://www.mediawiki.org/wiki/Extension:EventLogging
- Eventually, we'll want a means to sanitize and surface events that contain sensitive information, but I'd argue that is a second step that we should address later since it will likely require more substantial technical work.
-Aaron
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yes, getting EL data into labs would support longer term EEVS goals, and I'm trying to focus on EEVS features we can release this quarter.
On Wed, Aug 13, 2014 at 3:56 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
(expanding on what I think Dan is referring to re: goals), addressing this issue would allow EEVS to access data needed to generate breakdowns for metrics by method/target site (mobile, desktop, apps).
On Aug 13, 2014, at 1:40 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Kevin, for what it's worth I don't think that bug that Sean is asking for is that challenging. The relevant part we'd have to change is really just a few lines [1]. I respect your decision of course, but I just wanted to point out that this issue does drive towards some of our goals, as we talked a bit about getting EventLogging data to be usable by Wikimetrics, and this is the first step.
[1] - https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging/4d917e1...
On Wed, Aug 13, 2014 at 4:19 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
OK. Sounds reasonable. Sorry to seem as though I am pushing on you & the devs. In fact, specifying that you won't have the bandwidth to even consider the bug until next quarter gives me the power to push on others.
:)
Thanks! -Aaron
On Wed, Aug 13, 2014 at 8:56 PM, Kevin Leduc kevin@wikimedia.org wrote:
Hi Aaron,
I was not planning on prioritizing any EventLogging work for the rest of this quarter. The analytics dev team has a goal to get an EEVS dashboard running and I want to keep them focused otherwise we will not reach this goal.
I'm tempted to ask what springle and YuviPanda can accomplish without the help of the analytics devs, but even that will imply discussions and distractions from our goals.
In September I am planning on looking at what goals we can set for the next quarter and look at what we want to accomplish with EventLogging. I was going to prioritize it at that point.
On Wed, Aug 13, 2014 at 10:28 AM, Aaron Halfaker < ahalfaker@wikimedia.org> wrote:
Excellent. Kevin, can you work to get that bug[1] prioritized and let us know? I can start working with R&D on a proposal to bring to legal.
It stands to reason that you would be interested on the capsule too as
it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
Fair enough. I think we can drop that one column from the capsule and be quite happy with the rest. No need to purge EventLogging.
-Aaron
On Wed, Aug 13, 2014 at 6:08 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Re. (2), I didn't say anything about that being related to
public/private.
This is a request from springle -- that if we are going to start
pushing
Events to LabsDB, he'd like us to do so more efficiently. That bug
is about efficiently batching inserts. ah, my mistake. Kevin can do prioritization as needed.
If you are concerned about UserAgents as the sanitization page you
linked to suggests, then we should talk about the >EventLogging capsule, not the event. If you want to be so precise, sure, that is correct. Note that currently there is no distinction in storage as to the event and the capsule, they are stored together in the same record. Capsule data is only identified by a prefix on the column name. It stands to reason that you would be interested on the capsule too as it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
On Wed, Aug 13, 2014 at 6:47 PM, Aaron Halfaker < ahalfaker@wikimedia.org> wrote:
Re. (2), I didn't say anything about that being related to public/private. This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently. That bug is about efficiently batching inserts.
I don't know what you are talking about re. 90 day purges. I'm talking about 100% public Event logging events -- E.g. https://meta.wikimedia.org/wiki/Schema:PageMove Also, we do *not* need to purge EventLogging event data at 90 days. We need to purge PII at 90 days. We generally do not store PII in EventLogging events, but when we do, we organize 90 days purges as we have recently for the anonymous editor experiments. If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the EventLogging capsule, not the event.
Re. (1), we are already performing this review internally in order to determine what does and does not conform to the Data Retention Guidelines. It seems clear that a robust process could also identify non-sensitive Schemas that could be published in labs.
-Aaron
On Wed, Aug 13, 2014 at 5:00 PM, Nuria Ruiz nuria@wikimedia.org wrote:
> Aaron, > > >(2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 > The bug does not have to do with making data public. It has to do > with how data is inserted in to EL from the > consumers, so it deals with the 'system', not the 'data'. The raw > data as inserted cannot be replicated directly to be made public so whether > inserts are more efficient does not affect the public/private discussion. > > > >(1) there needs to be a good review process in place to make sure > that the data we surface isn't sensitive > There is a bunch of work involved on this item. For example: per our > privacy policy some of this data should be discarded after 90 days and > currently it is not. Also, you are aware of the discussions under > sanitization: > https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization > > Basically to make EL data public it needs to be aggregated with a > level of anonymization we think is acceptable. There is quite a bit of work > on this regard, here are some bugs that were filed a while back: > > https://bugzilla.wikimedia.org/show_bug.cgi?id=62978 > > https://bugzilla.wikimedia.org/show_bug.cgi?id=59832 > > > > > > > > On Wed, Aug 13, 2014 at 3:39 PM, Aaron Halfaker < > ahalfaker@wikimedia.org> wrote: > >> Hey folks, >> >> We've been discussing ways to make more Wikimedia data public. One >> of our sources for data is EventLogging (EL)[1], a system that lets us >> track events on both the client and server-side. Recently, YuviPanda and >> springle have been working with us to figure out what issues need to be >> resolved in order to begin loading EL events that contain public data[2] >> into LabsDB for public consumption and for use in WikiMetrics. >> >> It looks like there are three major concerns about directing EL to >> LabsDB. (1) there needs to be a good review process in place to make sure >> that the data we surface isn't sensitive, (2) >> https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to >> be addressed to make sure that we don't over-utilize labs infrastructure >> and (3) we'll need signoff from legal. >> >> It looks like (2) can be taken care of independently from (1) and >> (3). Is this bug already prioritized, and if not, could it be? >> >> 1. https://www.mediawiki.org/wiki/Extension:EventLogging >> 2. Eventually, we'll want a means to sanitize and surface events >> that contain sensitive information, but I'd argue that is a second step >> that we should address later since it will likely require more substantial >> technical work. >> >> -Aaron >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I second Kevin, making EL data public in a solid fashion (not ad-hoc) is a project that requires some work and we should take that one up after we have completed our current projects. We do not need EL data to implement most of the Editor Vital Signs metrics.
On Aug 14, 2014, at 2:05 AM, Kevin Leduc kevin@wikimedia.org wrote:
Yes, getting EL data into labs would support longer term EEVS goals, and I'm trying to focus on EEVS features we can release this quarter.
On Wed, Aug 13, 2014 at 3:56 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
(expanding on what I think Dan is referring to re: goals), addressing this issue would allow EEVS to access data needed to generate breakdowns for metrics by method/target site (mobile, desktop, apps).
On Aug 13, 2014, at 1:40 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Kevin, for what it's worth I don't think that bug that Sean is asking for is that challenging. The relevant part we'd have to change is really just a few lines [1]. I respect your decision of course, but I just wanted to point out that this issue does drive towards some of our goals, as we talked a bit about getting EventLogging data to be usable by Wikimetrics, and this is the first step.
[1] - https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging/4d917e1...
On Wed, Aug 13, 2014 at 4:19 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
OK. Sounds reasonable. Sorry to seem as though I am pushing on you & the devs. In fact, specifying that you won't have the bandwidth to even consider the bug until next quarter gives me the power to push on others. >:)
Thanks! -Aaron
On Wed, Aug 13, 2014 at 8:56 PM, Kevin Leduc kevin@wikimedia.org wrote:
Hi Aaron,
I was not planning on prioritizing any EventLogging work for the rest of this quarter. The analytics dev team has a goal to get an EEVS dashboard running and I want to keep them focused otherwise we will not reach this goal.
I'm tempted to ask what springle and YuviPanda can accomplish without the help of the analytics devs, but even that will imply discussions and distractions from our goals.
In September I am planning on looking at what goals we can set for the next quarter and look at what we want to accomplish with EventLogging. I was going to prioritize it at that point.
On Wed, Aug 13, 2014 at 10:28 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Excellent. Kevin, can you work to get that bug[1] prioritized and let us know? I can start working with R&D on a proposal to bring to legal.
> It stands to reason that you would be interested on the capsule too as it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.
Fair enough. I think we can drop that one column from the capsule and be quite happy with the rest. No need to purge EventLogging.
-Aaron
On Wed, Aug 13, 2014 at 6:08 PM, Nuria Ruiz nuria@wikimedia.org wrote: > > Re. (2), I didn't say anything about that being related to public/private. > > This is a request from springle -- that if we are going to start pushing > > Events to LabsDB, he'd like us to do so more efficiently. That bug is about efficiently batching inserts. > ah, my mistake. Kevin can do prioritization as needed. > > >If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the >EventLogging capsule, not the event. > If you want to be so precise, sure, that is correct. Note that currently there is no distinction in storage as to the event and the capsule, they are stored together in the same record. Capsule data is only identified by a prefix on the column name. It stands to reason that you would be interested on the capsule too as it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively. > > > > > > On Wed, Aug 13, 2014 at 6:47 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote: >> Re. (2), I didn't say anything about that being related to public/private. This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently. That bug is about efficiently batching inserts. >> >> I don't know what you are talking about re. 90 day purges. I'm talking about 100% public Event logging events -- E.g. https://meta.wikimedia.org/wiki/Schema:PageMove Also, we do *not* need to purge EventLogging event data at 90 days. We need to purge PII at 90 days. We generally do not store PII in EventLogging events, but when we do, we organize 90 days purges as we have recently for the anonymous editor experiments. If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the EventLogging capsule, not the event. >> >> Re. (1), we are already performing this review internally in order to determine what does and does not conform to the Data Retention Guidelines. It seems clear that a robust process could also identify non-sensitive Schemas that could be published in labs. >> >> -Aaron >> >> >> On Wed, Aug 13, 2014 at 5:00 PM, Nuria Ruiz nuria@wikimedia.org wrote: >>> Aaron, >>> >>> >(2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 >>> The bug does not have to do with making data public. It has to do with how data is inserted in to EL from the >>> consumers, so it deals with the 'system', not the 'data'. The raw data as inserted cannot be replicated directly to be made public so whether inserts are more efficient does not affect the public/private discussion. >>> >>> >>> >(1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive >>> There is a bunch of work involved on this item. For example: per our privacy policy some of this data should be discarded after 90 days and currently it is not. Also, you are aware of the discussions under sanitization: >>> https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization >>> >>> Basically to make EL data public it needs to be aggregated with a level of anonymization we think is acceptable. There is quite a bit of work on this regard, here are some bugs that were filed a while back: >>> >>> https://bugzilla.wikimedia.org/show_bug.cgi?id=62978 >>> >>> https://bugzilla.wikimedia.org/show_bug.cgi?id=59832 >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Aug 13, 2014 at 3:39 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote: >>>> Hey folks, >>>> >>>> We've been discussing ways to make more Wikimedia data public. One of our sources for data is EventLogging (EL)[1], a system that lets us track events on both the client and server-side. Recently, YuviPanda and springle have been working with us to figure out what issues need to be resolved in order to begin loading EL events that contain public data[2] into LabsDB for public consumption and for use in WikiMetrics. >>>> >>>> It looks like there are three major concerns about directing EL to LabsDB. (1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive, (2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to be addressed to make sure that we don't over-utilize labs infrastructure and (3) we'll need signoff from legal. >>>> >>>> It looks like (2) can be taken care of independently from (1) and (3). Is this bug already prioritized, and if not, could it be? >>>> >>>> 1. https://www.mediawiki.org/wiki/Extension:EventLogging >>>> 2. Eventually, we'll want a means to sanitize and surface events that contain sensitive information, but I'd argue that is a second step that we should address later since it will likely require more substantial technical work. >>>> >>>> -Aaron >>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics