Kevin, for what it's worth I don't think that bug that Sean is asking for is that challenging.  The relevant part we'd have to change is really just a few lines [1].  I respect your decision of course, but I just wanted to point out that this issue does drive towards some of our goals, as we talked a bit about getting EventLogging data to be usable by Wikimetrics, and this is the first step.


[1] - https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FEventLogging/4d917e1594e6f09784ab0e0bffccc144f87a11b3/server%2Feventlogging%2Fjrm.py#L167


On Wed, Aug 13, 2014 at 4:19 PM, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:
OK.  Sounds reasonable.  Sorry to seem as though I am pushing on you & the devs.  In fact, specifying that you won't have the bandwidth to even consider the bug until next quarter gives me the power to push on others.  >:)

Thanks!
-Aaron


On Wed, Aug 13, 2014 at 8:56 PM, Kevin Leduc <kevin@wikimedia.org> wrote:
Hi Aaron,

I was not planning on prioritizing any EventLogging work for the rest of this quarter.  The analytics dev team has a goal to get an EEVS dashboard running and I want to keep them focused otherwise we will not reach this goal.

I'm tempted to ask what springle and YuviPanda can accomplish without the help of the analytics devs, but even that will imply discussions and distractions from our goals.

In September I am planning on looking at what goals we can set for the next quarter and look at what we want to accomplish with EventLogging.  I was going to prioritize it at that point.




On Wed, Aug 13, 2014 at 10:28 AM, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:
Excellent.  Kevin, can you work to get that bug[1] prioritized and let us know?   I can start working with R&D on a proposal to bring to legal.  


It stands to reason that you would be interested on the capsule too as it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.

Fair enough.  I think we can drop that one column from the capsule and be quite happy with the rest.  No need to purge EventLogging.   

-Aaron


On Wed, Aug 13, 2014 at 6:08 PM, Nuria Ruiz <nuria@wikimedia.org> wrote:
Re. (2), I didn't say anything about that being related to public/private.  
> This is a request from springle -- that if we are going to start pushing 
> Events to LabsDB, he'd like us to do so more efficiently.  That bug is about efficiently batching inserts.
ah, my mistake. Kevin can do prioritization as needed.

>If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the >EventLogging capsule, not the event.  
If you want to be so precise, sure, that is correct. Note that currently there is no distinction in storage as to the event and the capsule, they are stored together in the same record. Capsule data is only identified by a prefix on the column name. It stands to reason that you would be interested on the capsule too as it holds the timestamp and wiki project the event applies to, but I imagine we can make fields public selectively.





On Wed, Aug 13, 2014 at 6:47 PM, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:
Re. (2), I didn't say anything about that being related to public/private.  This is a request from springle -- that if we are going to start pushing Events to LabsDB, he'd like us to do so more efficiently.  That bug is about efficiently batching inserts. 

I don't know what you are talking about re. 90 day purges.  I'm talking about 100% public Event logging events -- E.g. https://meta.wikimedia.org/wiki/Schema:PageMove   Also, we do *not* need to purge EventLogging event data at 90 days.  We need to purge PII at 90 days.  We generally do not store PII in EventLogging events, but when we do, we organize 90 days purges as we have recently for the anonymous editor experiments.  If you are concerned about UserAgents as the sanitization page you linked to suggests, then we should talk about the EventLogging capsule, not the event. 

Re. (1), we are already performing this review internally in order to determine what does and does not conform to the Data Retention Guidelines.  It seems clear that a robust process could also identify non-sensitive Schemas that could be published in labs.

-Aaron


On Wed, Aug 13, 2014 at 5:00 PM, Nuria Ruiz <nuria@wikimedia.org> wrote:
Aaron, 

The bug does not have to do with making data public. It has to do with how data is inserted in to EL from the 
consumers, so it deals with the 'system', not the 'data'. The raw data as inserted cannot be replicated directly to be made public so whether inserts are more efficient does not affect the public/private discussion.


>(1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive
There is a bunch of work involved on this item. For example: per our privacy policy some of this data should be discarded after 90 days and currently it is not. Also, you are aware of the discussions under sanitization: 

Basically to make EL data public it needs to be aggregated with a level of anonymization we think is acceptable. There is quite a bit of work on this regard, here are some bugs that were filed a while back:









On Wed, Aug 13, 2014 at 3:39 PM, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:
Hey folks,

We've been discussing ways to make more Wikimedia data public.  One of our sources for data is EventLogging (EL)[1], a system that lets us track events on both the client and server-side.  Recently, YuviPanda and springle have been working with us to figure out what issues need to be resolved in order to begin loading EL events that contain public data[2] into LabsDB for public consumption and for use in WikiMetrics.

It looks like there are three major concerns about directing EL to LabsDB.  (1) there needs to be a good review process in place to make sure that the data we surface isn't sensitive, (2) https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to be addressed to make sure that we don't over-utilize labs infrastructure and (3) we'll need signoff from legal. 

It looks like (2) can be taken care of independently from (1) and (3).  Is this bug already prioritized, and if not, could it be?

2. Eventually, we'll want a means to sanitize and surface events that contain sensitive information, but I'd argue that is a second step that we should address later since it will likely require more substantial technical work.

-Aaron


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics