Excellent. Kevin, can you work to get that bug[1] prioritized and let us
know? I can start working with R&D on a proposal to bring to legal.
1.
holds the timestamp and wiki project the event applies
to, but I imagine we
can make fields public selectively.
Fair enough. I think we can drop that one column from the capsule and be
quite happy with the rest. No need to purge EventLogging.
-Aaron
On Wed, Aug 13, 2014 at 6:08 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
Re. (2), I
didn't say anything about that being related to
public/private.
This is a request from springle -- that if we are
going to start pushing
Events to LabsDB, he'd like us to do so more efficiently. That bug is
about
efficiently batching inserts.
ah, my mistake. Kevin can do prioritization as needed.
If you are concerned about UserAgents as the
sanitization page you linked
to suggests, then we should talk about the
>EventLogging capsule, not the
event.
If you want to be so precise, sure, that is correct. Note that currently
there is no distinction in storage as to the event and the capsule, they
are stored together in the same record. Capsule data is only identified by
a prefix on the column name. It stands to reason that you would be
interested on the capsule too as it holds the timestamp and wiki project
the event applies to, but I imagine we can make fields public selectively.
On Wed, Aug 13, 2014 at 6:47 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org>
wrote:
Re. (2), I didn't say anything about that
being related to
public/private. This is a request from springle -- that if we are going to
start pushing Events to LabsDB, he'd like us to do so more efficiently.
That bug is about efficiently batching inserts.
I don't know what you are talking about re. 90 day purges. I'm talking
about 100% public Event logging events -- E.g.
https://meta.wikimedia.org/wiki/Schema:PageMove Also, we do *not* need
to purge EventLogging event data at 90 days. We need to purge PII at 90
days. We generally do not store PII in EventLogging events, but when we
do, we organize 90 days purges as we have recently for the anonymous editor
experiments. If you are concerned about UserAgents as the sanitization
page you linked to suggests, then we should talk about the EventLogging
capsule, not the event.
Re. (1), we are already performing this review internally in order to
determine what does and does not conform to the Data Retention Guidelines.
It seems clear that a robust process could also identify non-sensitive
Schemas that could be published in labs.
-Aaron
On Wed, Aug 13, 2014 at 5:00 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
Aaron,
The bug does not have to do
with making data public. It has to do with
how data is inserted in to EL from the
consumers, so it deals with the 'system', not the 'data'. The raw data
as inserted cannot be replicated directly to be made public so whether
inserts are more efficient does not affect the public/private discussion.
(1) there needs to be a good review process in
place to make sure that
the data we surface isn't sensitive
There is a bunch of work involved on this item. For example: per our
privacy policy some of this data should be discarded after 90 days and
currently it is not. Also, you are aware of the discussions under
sanitization:
https://www.mediawiki.org/wiki/EventLogging/UserAgentSanitization
Basically to make EL data public it needs to be aggregated with a level
of anonymization we think is acceptable. There is quite a bit of work on
this regard, here are some bugs that were filed a while back:
https://bugzilla.wikimedia.org/show_bug.cgi?id=62978
https://bugzilla.wikimedia.org/show_bug.cgi?id=59832
On Wed, Aug 13, 2014 at 3:39 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org
wrote:
Hey folks,
We've been discussing ways to make more Wikimedia data public. One of
our sources for data is EventLogging (EL)[1], a system that lets us track
events on both the client and server-side. Recently, YuviPanda and
springle have been working with us to figure out what issues need to be
resolved in order to begin loading EL events that contain public data[2]
into LabsDB for public consumption and for use in WikiMetrics.
It looks like there are three major concerns about directing EL to
LabsDB. (1) there needs to be a good review process in place to make sure
that the data we surface isn't sensitive, (2)
https://bugzilla.wikimedia.org/show_bug.cgi?id=67450 will need to be
addressed to make sure that we don't over-utilize labs infrastructure and
(3) we'll need signoff from legal.
It looks like (2) can be taken care of independently from (1) and (3).
Is this bug already prioritized, and if not, could it be?
1.
https://www.mediawiki.org/wiki/Extension:EventLogging
2. Eventually, we'll want a means to sanitize and surface events that
contain sensitive information, but I'd argue that is a second step that we
should address later since it will likely require more substantial
technical work.
-Aaron
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics