why did this happen?
Martin
See https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server...
I do appreciate that the ops team is working to improve reliability and performance of the database access. Unfortunately it seems to me that there is a disconnect between ops and tool devs. I wonder if the ops actually looked at how many user databases have been created and how frequently they got accessed (all that info should be readily available to them). The logs would also have told the ops which users relied in user DBs on the project DB servers. A direct email ahead of time would have gone a long way. The phabricator post contains the same language I've heard many times before: The tools devs shouldn't have used the feature anyways. To that I say, well, we still did and it worked great. Volunteer developers have a limited time budged with which they create tools that large amounts of users (editors and readers alike) rely on. That is just the reality of things, and it is not the ideal op fantasy, I know. The ops seem to be in an asymmetric position of power here. It sure sounds a lot like a take it or leave it situation to me.
On Sat, Dec 23, 2017 at 12:13 PM John phoenixoverride@gmail.com wrote:
why did this happen?
Martin
See https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server... _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
To play devils advocate here they are dedicating and maintaining a service that we get free of charge. Other people have to pay a premium for the service we get.
Cyberpower678 English Wikipedia Account Creation Team English Wikipedia Administrator Global User Renamer
On Dec 23, 2017, at 19:28, Daniel Schwen lists@schwen.de wrote:
I do appreciate that the ops team is working to improve reliability and performance of the database access. Unfortunately it seems to me that there is a disconnect between ops and tool devs. I wonder if the ops actually looked at how many user databases have been created and how frequently they got accessed (all that info should be readily available to them). The logs would also have told the ops which users relied in user DBs on the project DB servers. A direct email ahead of time would have gone a long way. The phabricator post contains the same language I've heard many times before: The tools devs shouldn't have used the feature anyways. To that I say, well, we still did and it worked great. Volunteer developers have a limited time budged with which they create tools that large amounts of users (editors and readers alike) rely on. That is just the reality of things, and it is not the ideal op fantasy, I know. The ops seem to be in an asymmetric position of power here. It sure sounds a lot like a take it or leave it situation to me.
On Sat, Dec 23, 2017 at 12:13 PM John phoenixoverride@gmail.com wrote: why did this happen?
Martin
See https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server... _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Hoi, That is the most awful argument possible. It follows that there is no reason to provide quality functionality in MediaWiki because it is free as well. SHAME on you for going this low. Thanks, GerardM
On 24 December 2017 at 03:13, Maximilian Doerr maximilian.doerr@gmail.com wrote:
To play devils advocate here they are dedicating and maintaining a service that we get free of charge. Other people have to pay a premium for the service we get.
Cyberpower678 English Wikipedia Account Creation Team English Wikipedia Administrator Global User Renamer
On Dec 23, 2017, at 19:28, Daniel Schwen lists@schwen.de wrote:
I do appreciate that the ops team is working to improve reliability and performance of the database access. Unfortunately it seems to me that there is a disconnect between ops and tool devs. I wonder if the ops actually looked at how many user databases have been created and how frequently they got accessed (all that info should be readily available to them). The logs would also have told the ops which users relied in user DBs on the project DB servers. A direct email ahead of time would have gone a long way. The phabricator post contains the same language I've heard many times before: The tools devs shouldn't have used the feature anyways. To that I say, well, we still did and it worked great. Volunteer developers have a limited time budged with which they create tools that large amounts of users (editors and readers alike) rely on. That is just the reality of things, and it is not the ideal op fantasy, I know. The ops seem to be in an asymmetric position of power here. It sure sounds a lot like a take it or leave it situation to me.
On Sat, Dec 23, 2017 at 12:13 PM John phoenixoverride@gmail.com wrote:
why did this happen?
Martin
See https://phabricator.wikimedia.org/phame/post/view/70/new_ wiki_replica_servers_ready_for_use/ _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
On Sat, Dec 23, 2017 at 5:28 PM, Daniel Schwen lists@schwen.de wrote:
I do appreciate that the ops team is working to improve reliability and performance of the database access. Unfortunately it seems to me that there is a disconnect between ops and tool devs. I wonder if the ops actually looked at how many user databases have been created and how frequently they got accessed (all that info should be readily available to them). The logs would also have told the ops which users relied in user DBs on the project DB servers. A direct email ahead of time would have gone a long way.
As noted previously in this thread, the breaking change was first announced in the blog post about the new Wiki Replica servers (https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/) on 2017-09-25. The TL;DR and a link to the blog post were also sent to labs-announce (now cloud-announce) at that time: https://lists.wikimedia.org/pipermail/labs-announce/2017-September/000256.html
Following that "soft" announcement: * I built a tool at https://tools.wmflabs.org/tool-db-usage/ to show all of the tool owned databases that would be effected by the change. * I created a page on wikitech describing the timeline and impact and providing a link to the tool: https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown * The timeline was announced on the cloud-announce mailing list on 2017-10-19: https://lists.wikimedia.org/pipermail/cloud-announce/2017-October/000005.html * MassMessage was used to notify the maintainers of tools that Nick Wilson and I could identify via their wikitech talk pages. Example at https://wikitech.wikimedia.org/w/index.php?title=User_talk:Andrew_Bogott&diff=1775669&oldid=1773948
I tried pretty hard here to make sure that tool maintainers who were going to be effected had months of notice. Obviously this notice did not reach everyone and for that I am sorry. Making announcements to 1500 users is difficult. The cloud-announce mailing list is really the best way that we as administrators have to reach out to people about sweeping changes like this. We can't force anyone to subscribe or to read the messages however.
The phabricator post contains the same language I've heard many times before: The tools devs shouldn't have used the feature anyways. To that I say, well, we still did and it worked great.
I may be missing it, but I do not see anywhere on https://phabricator.wikimedia.org/T156869 that any of the participants chastised the tool developers for using the feature. If I did say something that was taken that way, I apologize.
Volunteer developers have a limited time budged with which they create tools that large amounts of users (editors and readers alike) rely on. That is just the reality of things, and it is not the ideal op fantasy, I know.
Tool developers use the features they are given to build incredible things. They do this work as volunteers in time that is borrowed from the rest of their lives (school, work, family, editing the wikis, etc). The Cloud Services and DBA teams are *very* aware of this and very grateful for the good works that come from these precious investments. I have spent the last two years of my employment at the Foundation seeking to raise awareness of these good works and to find more resources to help the people who are doing them.
The ops seem to be in an asymmetric position of power here. It sure sounds a lot like a take it or leave it situation to me.
Yes, there is an asymmetry. A very small number of us have to make decisions that effect larger numbers. This is true with the Wiki Replicas; it is true with Cloud Services more generally; it is true with on-wiki content creators vs readers. In all of these cases the few attempt to act in the broader best interest of the many. We try to have consultations with representatives of the groups that we are acting on behalf of. We try to use good judgment and past experience to make better decisions tomorrow than we made yesterday. We hope that the positive impacts of our works out weigh the negative impacts. Whether we succeed of fail in these attempts can be a matter of personal opinion. Not everyone will be pleased by every change; this is unfortunate but true.
In this very specific case, I made the final call to cease looking for a technological advance that would allow us to keep the feature of user managed databases co-located with replicated data from the production environment. I did this after much more extensive consultation with my team and the Foundation's DBAs than is reflected in T156869. This had been a topic of internal discussion since the beginning of the project to build a new Wiki Replica cluster. In the end, I felt that the barriers to freely re-routing database query traffic were too large, and the benefits of that freedom too great, to recreate the prior un-replicated table situation on the new cluster. The blog post mentions many of these benefits.
We are still hoping to find a partial solution (https://phabricator.wikimedia.org/T173511) for replicating some non-canonical data to the new cluster. Work on that task has stagnated, but I hope to restart it soon. I think that Jaime has most of a solution in mind at this point which just needs the final details to be worked out before we can begin to implement it. This will not be a 100% solution for all tools, but it will provide some relief.
I know that my responses here will not fix broken tools. I know that tool maintainers experience some amount of fatigue and frustration caused by each new change added to the environment that they are using to build and deliver their solutions. I do hope however that they restore some measure of WP:AGF for the work of the Cloud Services team, the DBA team, and others who are trying every day to make Toolforge and Cloud Services a better place for developing and operating volunteer created technology.
Bryan
On a somewhat related note, CentralAuth.labsdb is offline. Any info that can be provided about that? An important bot task is unable to start because of not being able to connect to it.
Cyberpower678 English Wikipedia Account Creation Team English Wikipedia Administrator Global User Renamer
On Dec 23, 2017, at 22:07, Bryan Davis bd808@wikimedia.org wrote:
On Sat, Dec 23, 2017 at 5:28 PM, Daniel Schwen lists@schwen.de wrote:
I do appreciate that the ops team is working to improve reliability and performance of the database access. Unfortunately it seems to me that there is a disconnect between ops and tool devs. I wonder if the ops actually looked at how many user databases have been created and how frequently they got accessed (all that info should be readily available to them). The logs would also have told the ops which users relied in user DBs on the project DB servers. A direct email ahead of time would have gone a long way.
As noted previously in this thread, the breaking change was first announced in the blog post about the new Wiki Replica servers (https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/) on 2017-09-25. The TL;DR and a link to the blog post were also sent to labs-announce (now cloud-announce) at that time: https://lists.wikimedia.org/pipermail/labs-announce/2017-September/000256.html
Following that "soft" announcement:
- I built a tool at https://tools.wmflabs.org/tool-db-usage/ to show
all of the tool owned databases that would be effected by the change.
- I created a page on wikitech describing the timeline and impact and
providing a link to the tool: https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown
- The timeline was announced on the cloud-announce mailing list on
2017-10-19: https://lists.wikimedia.org/pipermail/cloud-announce/2017-October/000005.html
- MassMessage was used to notify the maintainers of tools that Nick
Wilson and I could identify via their wikitech talk pages. Example at https://wikitech.wikimedia.org/w/index.php?title=User_talk:Andrew_Bogott&diff=1775669&oldid=1773948
I tried pretty hard here to make sure that tool maintainers who were going to be effected had months of notice. Obviously this notice did not reach everyone and for that I am sorry. Making announcements to 1500 users is difficult. The cloud-announce mailing list is really the best way that we as administrators have to reach out to people about sweeping changes like this. We can't force anyone to subscribe or to read the messages however.
The phabricator post contains the same language I've heard many times before: The tools devs shouldn't have used the feature anyways. To that I say, well, we still did and it worked great.
I may be missing it, but I do not see anywhere on https://phabricator.wikimedia.org/T156869 that any of the participants chastised the tool developers for using the feature. If I did say something that was taken that way, I apologize.
Volunteer developers have a limited time budged with which they create tools that large amounts of users (editors and readers alike) rely on. That is just the reality of things, and it is not the ideal op fantasy, I know.
Tool developers use the features they are given to build incredible things. They do this work as volunteers in time that is borrowed from the rest of their lives (school, work, family, editing the wikis, etc). The Cloud Services and DBA teams are *very* aware of this and very grateful for the good works that come from these precious investments. I have spent the last two years of my employment at the Foundation seeking to raise awareness of these good works and to find more resources to help the people who are doing them.
The ops seem to be in an asymmetric position of power here. It sure sounds a lot like a take it or leave it situation to me.
Yes, there is an asymmetry. A very small number of us have to make decisions that effect larger numbers. This is true with the Wiki Replicas; it is true with Cloud Services more generally; it is true with on-wiki content creators vs readers. In all of these cases the few attempt to act in the broader best interest of the many. We try to have consultations with representatives of the groups that we are acting on behalf of. We try to use good judgment and past experience to make better decisions tomorrow than we made yesterday. We hope that the positive impacts of our works out weigh the negative impacts. Whether we succeed of fail in these attempts can be a matter of personal opinion. Not everyone will be pleased by every change; this is unfortunate but true.
In this very specific case, I made the final call to cease looking for a technological advance that would allow us to keep the feature of user managed databases co-located with replicated data from the production environment. I did this after much more extensive consultation with my team and the Foundation's DBAs than is reflected in T156869. This had been a topic of internal discussion since the beginning of the project to build a new Wiki Replica cluster. In the end, I felt that the barriers to freely re-routing database query traffic were too large, and the benefits of that freedom too great, to recreate the prior un-replicated table situation on the new cluster. The blog post mentions many of these benefits.
We are still hoping to find a partial solution (https://phabricator.wikimedia.org/T173511) for replicating some non-canonical data to the new cluster. Work on that task has stagnated, but I hope to restart it soon. I think that Jaime has most of a solution in mind at this point which just needs the final details to be worked out before we can begin to implement it. This will not be a 100% solution for all tools, but it will provide some relief.
I know that my responses here will not fix broken tools. I know that tool maintainers experience some amount of fatigue and frustration caused by each new change added to the environment that they are using to build and deliver their solutions. I do hope however that they restore some measure of WP:AGF for the work of the Cloud Services team, the DBA team, and others who are trying every day to make Toolforge and Cloud Services a better place for developing and operating volunteer created technology.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA irc: bd808 v:415.839.6885 x6855
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
On Sat, Dec 23, 2017 at 9:36 PM, Maximilian Doerr maximilian.doerr@gmail.com wrote:
On a somewhat related note, CentralAuth.labsdb is offline. Any info that can be provided about that? An important bot task is unable to start because of not being able to connect to it.
Filed as https://phabricator.wikimedia.org/T183651.
I can fix this quickly for centralauth.{analytics,web}.db.svc.eqiad.wmflabs. It will take a bit longer to get centralauth.labsdb created as it requires a Puppet change that I will need help from a prod root to deploy.
Bryan
Bryan, thank you for your thoughtful response. It seems I have done you some injustice. I appreciate your due diligence such as the tool you wrote to list database usage. However especially in light of this tool I have trouble understanding how the decision could have been made to beak all the tools that depend on server join with user tables. I'm sure there is a lot of personal opinion going into this, and obviously my perspective and priorities as a tool developer id very different from your as an op. For me it genuinely looked like everything has been working fine before the change and now my biggest tool is destroyed. Hard to see the upside here. I'm probably guilty of having my head stuck in the sand, as somehow I have been oblivious to your multiple communications channels. But in my defense surely you have a way to associate user accounts with emails? So was there no way to leave a server up where the joins are still possible? It wouldn't have to have the same uptime guarantees... Cheers, Daniel Idaho Falls, ID USA
On Sat, Dec 23, 2017 at 8:07 PM Bryan Davis bd808@wikimedia.org wrote:
On Sat, Dec 23, 2017 at 5:28 PM, Daniel Schwen lists@schwen.de wrote:
I do appreciate that the ops team is working to improve reliability and performance of the database access. Unfortunately it seems to me that
there
is a disconnect between ops and tool devs. I wonder if the ops actually looked at how many user databases have been created and how frequently
they
got accessed (all that info should be readily available to them). The
logs
would also have told the ops which users relied in user DBs on the
project
DB servers. A direct email ahead of time would have gone a long way.
As noted previously in this thread, the breaking change was first announced in the blog post about the new Wiki Replica servers (< https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server...
)
on 2017-09-25. The TL;DR and a link to the blog post were also sent to labs-announce (now cloud-announce) at that time: < https://lists.wikimedia.org/pipermail/labs-announce/2017-September/000256.ht...
Following that "soft" announcement:
- I built a tool at https://tools.wmflabs.org/tool-db-usage/ to show
all of the tool owned databases that would be effected by the change.
- I created a page on wikitech describing the timeline and impact and
providing a link to the tool: https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown
- The timeline was announced on the cloud-announce mailing list on
2017-10-19: < https://lists.wikimedia.org/pipermail/cloud-announce/2017-October/000005.htm...
- MassMessage was used to notify the maintainers of tools that Nick
Wilson and I could identify via their wikitech talk pages. Example at < https://wikitech.wikimedia.org/w/index.php?title=User_talk:Andrew_Bogott&...
I tried pretty hard here to make sure that tool maintainers who were going to be effected had months of notice. Obviously this notice did not reach everyone and for that I am sorry. Making announcements to 1500 users is difficult. The cloud-announce mailing list is really the best way that we as administrators have to reach out to people about sweeping changes like this. We can't force anyone to subscribe or to read the messages however.
The phabricator post contains the same language I've heard many times before: The tools devs shouldn't have used the feature anyways. To that I say, well, we still did and it worked great.
I may be missing it, but I do not see anywhere on https://phabricator.wikimedia.org/T156869 that any of the participants chastised the tool developers for using the feature. If I did say something that was taken that way, I apologize.
Volunteer developers have a limited time budged with which they create tools that large amounts of
users
(editors and readers alike) rely on. That is just the reality of things,
and
it is not the ideal op fantasy, I know.
Tool developers use the features they are given to build incredible things. They do this work as volunteers in time that is borrowed from the rest of their lives (school, work, family, editing the wikis, etc). The Cloud Services and DBA teams are *very* aware of this and very grateful for the good works that come from these precious investments. I have spent the last two years of my employment at the Foundation seeking to raise awareness of these good works and to find more resources to help the people who are doing them.
The ops seem to be in an asymmetric position of power here. It sure sounds a lot like a take it or leave it situation to me.
Yes, there is an asymmetry. A very small number of us have to make decisions that effect larger numbers. This is true with the Wiki Replicas; it is true with Cloud Services more generally; it is true with on-wiki content creators vs readers. In all of these cases the few attempt to act in the broader best interest of the many. We try to have consultations with representatives of the groups that we are acting on behalf of. We try to use good judgment and past experience to make better decisions tomorrow than we made yesterday. We hope that the positive impacts of our works out weigh the negative impacts. Whether we succeed of fail in these attempts can be a matter of personal opinion. Not everyone will be pleased by every change; this is unfortunate but true.
In this very specific case, I made the final call to cease looking for a technological advance that would allow us to keep the feature of user managed databases co-located with replicated data from the production environment. I did this after much more extensive consultation with my team and the Foundation's DBAs than is reflected in T156869. This had been a topic of internal discussion since the beginning of the project to build a new Wiki Replica cluster. In the end, I felt that the barriers to freely re-routing database query traffic were too large, and the benefits of that freedom too great, to recreate the prior un-replicated table situation on the new cluster. The blog post mentions many of these benefits.
We are still hoping to find a partial solution (https://phabricator.wikimedia.org/T173511) for replicating some non-canonical data to the new cluster. Work on that task has stagnated, but I hope to restart it soon. I think that Jaime has most of a solution in mind at this point which just needs the final details to be worked out before we can begin to implement it. This will not be a 100% solution for all tools, but it will provide some relief.
I know that my responses here will not fix broken tools. I know that tool maintainers experience some amount of fatigue and frustration caused by each new change added to the environment that they are using to build and deliver their solutions. I do hope however that they restore some measure of WP:AGF for the work of the Cloud Services team, the DBA team, and others who are trying every day to make Toolforge and Cloud Services a better place for developing and operating volunteer created technology.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA irc: bd808 v:415.839.6885 x6855 <(415)%20839-6885>
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
On Wed, Dec 27, 2017 at 11:06 PM, Martin Domdey animalia@gmx.net wrote:
Hi!
There are replags reported again: Replag reported by heartbeat_p
wikireplica-analytics.eqiad.wmnet (s1,s3,s5) - lag time: 08:02:14
Madhu is checking with the DBA team to see if this is expected due to some maintenance script running or a sign of trouble somewhere in the replication pipeline.
c1.labsdb,c3.labsdb - lag time: 334:32:18
c1 and c3 are in read-only mode with replication disabled. These service names point to the labsdb1001 and labsdb1003 servers which are scheduled to be turned off and removed from service on 2018-01-03 [0].
[0]: https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown
Bryan