We are changing EventLogging to write events to m2 instead of db1047. The migration will take up to 12 hours (but probably less). Also, we may end up with gaps in the data written to the database throughout this period.
We will reply to this thread once the migration is complete.
Update:
The migration is in progress. Small correction from before:
We are changing EventLogging to write events to m2 instead of db1047.
Correction: m1 instead of m2, and the new data will be written to db1048.
Currently we're cautiously optimistic that we haven't dropped any data. However, for now, db1047 is not receiving any new data from EventLogging. All the new data is going into db1048. The old data is being loaded into 1048 as well, and should complete over the next day or so. Once done, we'll replicate to db1047 from db1048 and that should be it. We'll keep updating this thread.
Is this expected to slow stuff down on db1047 - or are the queries I'm running just horribly inefficient? ;)
On Thu, Apr 24, 2014 at 2:40 PM, Dan Andreescu dandreescu@wikimedia.orgwrote:
Update:
The migration is in progress. Small correction from before:
We are changing EventLogging to write events to m2 instead of db1047.
Correction: m1 instead of m2, and the new data will be written to db1048.
Currently we're cautiously optimistic that we haven't dropped any data. However, for now, db1047 is not receiving any new data from EventLogging. All the new data is going into db1048. The old data is being loaded into 1048 as well, and should complete over the next day or so. Once done, we'll replicate to db1047 from db1048 and that should be it. We'll keep updating this thread.
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
I got a quick clarification discussion with Ori, here’s a summary:
• nothing in the migration should affect data consumers other than a short lag in the availability of EL data and possibly a short downtime for db1047 (which will make the enwiki slave and prod/staging DBs temporarily unavailable) • db1048 will be the production DB for EventLogging data, EL data will be replicated from there to db1047 meaning that we avoid low performance queries on db1047 to affect the storage of EL data into db1047 (the worst that can happen is a replication lag from db1048). Only ops will have access to db1048
Dan/Ori/Sean: please add anything that I might have missed
Dario
On Apr 24, 2014, at 3:30 PM, Maryana Pinchuk mpinchuk@wikimedia.org wrote:
Is this expected to slow stuff down on db1047 – or are the queries I'm running just horribly inefficient? ;)
On Thu, Apr 24, 2014 at 2:40 PM, Dan Andreescu dandreescu@wikimedia.org wrote: Update:
The migration is in progress. Small correction from before:
We are changing EventLogging to write events to m2 instead of db1047.
Correction: m1 instead of m2, and the new data will be written to db1048.
Currently we're cautiously optimistic that we haven't dropped any data. However, for now, db1047 is not receiving any new data from EventLogging. All the new data is going into db1048. The old data is being loaded into 1048 as well, and should complete over the next day or so. Once done, we'll replicate to db1047 from db1048 and that should be it. We'll keep updating this thread.
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
-- Maryana Pinchuk Product Manager, Wikimedia Foundation wikimediafoundation.org _______________________________________________ Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
Is this expected to slow stuff down on db1047 - or are the queries I'm running just horribly inefficient? ;)
What Dario said is all accurate. But to answer your question directly,
yes, db1047 should be a little slow right now. Maybe a lot slow. Because Sean is copying all the data from it to db1048. This is expected to finish within single-digit hours from a couple hours ago, so it should be fine soon.
An update on this. Basically, db1047, which is what everyone's querying to get EventLogging data, is still catching up. The data there, at least for some schemas, is lagging and seems to have no new entries since "2014-04-27 12:10:00". The data that EventLogging captured since then is safe and sound on db1048, which is where db1047 is trying to replicate from. We will update this bug when the replication is all caught up:
https://bugzilla.wikimedia.org/show_bug.cgi?id=64445
So add yourselves to that bug and watch for it being Fixed. Until then, dashboards and other things relying on this data will be out of date, but at least no longer erroring.
And, of course, thanks very much to Sean who is working crazy hours to get this working. Sean - beverage of choice sir, anytime.
On Wed, Apr 30, 2014 at 5:54 PM, Dan Andreescu dandreescu@wikimedia.orgwrote:
An update on this. Basically, db1047, which is what everyone's querying to get EventLogging data, is still catching up. The data there, at least for some schemas, is lagging and seems to have no new entries since "2014-04-27 12:10:00". The data that EventLogging captured since then is safe and sound on db1048, which is where db1047 is trying to replicate from. We will update this bug when the replication is all caught up:
https://bugzilla.wikimedia.org/show_bug.cgi?id=64445
So add yourselves to that bug and watch for it being Fixed. Until then, dashboards and other things relying on this data will be out of date, but at least no longer erroring.
On Thu, May 1, 2014 at 7:54 AM, Dan Andreescu dandreescu@wikimedia.orgwrote:
And, of course, thanks very much to Sean who is working crazy hours to get this working. Sean - beverage of choice sir, anytime.
Not so much "working", just short bursts or work followed by a lot of waiting :-) Databases. Bah!
Note that eventlogging replication has already caught on the One Box, analytics-store.eqiad.wmnet; any tools that use the 'research' user and read-only queries could consider switching to analytics-store and give db1047 time to recover.
On Wed, Apr 30, 2014 at 7:24 PM, Sean Pringle springle@wikimedia.orgwrote:
On Thu, May 1, 2014 at 7:54 AM, Dan Andreescu dandreescu@wikimedia.orgwrote:
And, of course, thanks very much to Sean who is working crazy hours to get this working. Sean - beverage of choice sir, anytime.
Not so much "working", just short bursts or work followed by a lot of waiting :-) Databases. Bah!
Note that eventlogging replication has already caught on the One Box, analytics-store.eqiad.wmnet; any tools that use the 'research' user and read-only queries could consider switching to analytics-store and give db1047 time to recover.
I finished loading data logged during the migration process into the database. There should be no gaps at all. db1047 has still not fully caught up yet, but it is getting closer -- I see it picking up events from the 29th.