Sean:
Could explain a little bit why the following bug affects EL data going public (for the schemas that have public data and can be made public more easily than others)
https://bugzilla.wikimedia.org/show_bug.cgi?id=67450
Thanks,
Nuria
Hi Nuria,
On Fri, Aug 22, 2014 at 2:06 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Sean:
Could explain a little bit why the following bug affects EL data going public (for the schemas that have public data and can be made public more easily than others)
Performance.
The labs servers must replicate multiple clusters like analytics-store does, plus they must deal with large volumes of unpredictable load from labs users. Given that researchers with decent SQL can cause replag on analytics-store, imagine what a legion of labs users with poor SQL can achieve :-)
While I realize that only certain EL tables will be replicated, the following are the important points from Ops POV:
- Requesting this bug fix up front will improve both labsdb and analytics-store. It is likely that the relevant EL tables will have to replicate to labs /via/ analytics-store, so the effect will be two-fold (just as the penalty would be).
- Even though only specific tables will be replicated, that filtering necessarily occurs on the target slave servers and not on the master. This means the entire load from EL replication will hit labs or sanitarium servers, even though only the public EL tables will be processed and exposed.
- Optimization requests like bug 67450 are easy to put off because everyone (understandably) wants to get on with exciting new projects. In a well-meaning and respectful way, this is an opportunity to stonewall and request performance is addressed now.
BR Sean
While I realize that only certain EL tables will be replicated,
Not only that, only "some" fields from "some" tables will be replicated. Also as you know EL creates new tables on demand so the number of tables to replicate is dynamically changing. Do these points affect the request in any way?
In a well-meaning and respectful way, this is an opportunity to stonewall
and request performance is addressed now. Do not worry. I am in the "performance is crucial" camp.
On Fri, Aug 22, 2014 at 3:09 AM, Sean Pringle springle@wikimedia.org wrote:
Hi Nuria,
On Fri, Aug 22, 2014 at 2:06 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Sean:
Could explain a little bit why the following bug affects EL data going public (for the schemas that have public data and can be made public more easily than others)
Performance.
The labs servers must replicate multiple clusters like analytics-store does, plus they must deal with large volumes of unpredictable load from labs users. Given that researchers with decent SQL can cause replag on analytics-store, imagine what a legion of labs users with poor SQL can achieve :-)
While I realize that only certain EL tables will be replicated, the following are the important points from Ops POV:
- Requesting this bug fix up front will improve both labsdb and
analytics-store. It is likely that the relevant EL tables will have to replicate to labs /via/ analytics-store, so the effect will be two-fold (just as the penalty would be).
- Even though only specific tables will be replicated, that filtering
necessarily occurs on the target slave servers and not on the master. This means the entire load from EL replication will hit labs or sanitarium servers, even though only the public EL tables will be processed and exposed.
- Optimization requests like bug 67450 are easy to put off because
everyone (understandably) wants to get on with exciting new projects. In a well-meaning and respectful way, this is an opportunity to stonewall and request performance is addressed now.
BR Sean
-- DBA @ WMF
On Fri, Aug 22, 2014 at 4:05 PM, Nuria Ruiz nuria@wikimedia.org wrote:
While I realize that only certain EL tables will be replicated,
Not only that, only "some" fields from "some" tables will be replicated. Also as you know EL creates new tables on demand so the number of tables to replicate is dynamically changing. Do these points affect the request in any way?
My points all still apply regardless.
The filtering of specific fields is the task of the sanitarium servers which sit between production and labsdb. We use a combination of replication rules to filter tables and triggers to null out sensitive fields, then SQL views and specific permissions to hide some fields entirely.
Dynamically added EL tables make it all a little more complex, but not a problem technically, since I understand each new EL table would require Legal OK first, just like mediawiki tables do. There is (I hope) no expectation of instant or automated changes to the replication setup.
Dynamically added EL tables make it all a little more complex, but not a
problem technically, since I understand each new >EL table would require Legal OK first, just like mediawiki tables do Ok, for the sake of clarity let's make sure we are talking about the same thing: EL tables are created per schema version, so, let's say, table NavigationTiming_1000 is being replicated. An schema version change happens and then there is a new table NavigationTiming_1001 that has couple additional fields but other than that is identical to the prior table.
Will the replication of this second table be automatic or will it need manual intervention?
On Fri, Aug 22, 2014 at 8:49 AM, Sean Pringle springle@wikimedia.org wrote:
On Fri, Aug 22, 2014 at 4:05 PM, Nuria Ruiz nuria@wikimedia.org wrote:
While I realize that only certain EL tables will be replicated,
Not only that, only "some" fields from "some" tables will be replicated. Also as you know EL creates new tables on demand so the number of tables to replicate is dynamically changing. Do these points affect the request in any way?
My points all still apply regardless.
The filtering of specific fields is the task of the sanitarium servers which sit between production and labsdb. We use a combination of replication rules to filter tables and triggers to null out sensitive fields, then SQL views and specific permissions to hide some fields entirely.
Dynamically added EL tables make it all a little more complex, but not a problem technically, since I understand each new EL table would require Legal OK first, just like mediawiki tables do. There is (I hope) no expectation of instant or automated changes to the replication setup.
On Friday, August 22, 2014, Nuria Ruiz nuria@wikimedia.org wrote:
Dynamically added EL tables make it all a little more complex, but not a
problem technically, since I understand each new >EL table would require Legal OK first, just like mediawiki tables do Ok, for the sake of clarity let's make sure we are talking about the same thing: EL tables are created per schema version, so, let's say, table NavigationTiming_1000 is being replicated. An schema version change happens and then there is a new table NavigationTiming_1001 that has couple additional fields but other than that is identical to the prior table.
Will the replication of this second table be automatic or will it need manual intervention?
Should be manual or else we'd be doing something wrong. The new fields could be anything including private data.
On Fri, Aug 22, 2014 at 7:35 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Ok, for the sake of clarity let's make sure we are talking about the same thing: EL tables are created per schema version, so, let's say, table NavigationTiming_1000 is being replicated. An schema version change happens and then there is a new table NavigationTiming_1001 that has couple additional fields but other than that is identical to the prior table.
Will the replication of this second table be automatic or will it need manual intervention?
The second table will not replicate without manual intervention. It would require at least input from Legal plus an Ops ticket, and there would be a delay of at least several days once approved.
At first I was worried, but the public data schemas are highly stable. Most of the schemas I have in mind have never switched revids after they started writing data.
-Aaron
On Fri, Aug 22, 2014 at 1:25 PM, Sean Pringle springle@wikimedia.org wrote:
On Fri, Aug 22, 2014 at 7:35 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Ok, for the sake of clarity let's make sure we are talking about the same thing: EL tables are created per schema version, so, let's say, table NavigationTiming_1000 is being replicated. An schema version change happens and then there is a new table NavigationTiming_1001 that has couple additional fields but other than that is identical to the prior table.
Will the replication of this second table be automatic or will it need manual intervention?
The second table will not replicate without manual intervention. It would require at least input from Legal plus an Ops ticket, and there would be a delay of at least several days once approved.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics