On Sat, Dec 23, 2017 at 5:28 PM, Daniel Schwen lists@schwen.de wrote:
I do appreciate that the ops team is working to improve reliability and performance of the database access. Unfortunately it seems to me that there is a disconnect between ops and tool devs. I wonder if the ops actually looked at how many user databases have been created and how frequently they got accessed (all that info should be readily available to them). The logs would also have told the ops which users relied in user DBs on the project DB servers. A direct email ahead of time would have gone a long way.
As noted previously in this thread, the breaking change was first announced in the blog post about the new Wiki Replica servers (https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/) on 2017-09-25. The TL;DR and a link to the blog post were also sent to labs-announce (now cloud-announce) at that time: https://lists.wikimedia.org/pipermail/labs-announce/2017-September/000256.html
Following that "soft" announcement: * I built a tool at https://tools.wmflabs.org/tool-db-usage/ to show all of the tool owned databases that would be effected by the change. * I created a page on wikitech describing the timeline and impact and providing a link to the tool: https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown * The timeline was announced on the cloud-announce mailing list on 2017-10-19: https://lists.wikimedia.org/pipermail/cloud-announce/2017-October/000005.html * MassMessage was used to notify the maintainers of tools that Nick Wilson and I could identify via their wikitech talk pages. Example at https://wikitech.wikimedia.org/w/index.php?title=User_talk:Andrew_Bogott&diff=1775669&oldid=1773948
I tried pretty hard here to make sure that tool maintainers who were going to be effected had months of notice. Obviously this notice did not reach everyone and for that I am sorry. Making announcements to 1500 users is difficult. The cloud-announce mailing list is really the best way that we as administrators have to reach out to people about sweeping changes like this. We can't force anyone to subscribe or to read the messages however.
The phabricator post contains the same language I've heard many times before: The tools devs shouldn't have used the feature anyways. To that I say, well, we still did and it worked great.
I may be missing it, but I do not see anywhere on https://phabricator.wikimedia.org/T156869 that any of the participants chastised the tool developers for using the feature. If I did say something that was taken that way, I apologize.
Volunteer developers have a limited time budged with which they create tools that large amounts of users (editors and readers alike) rely on. That is just the reality of things, and it is not the ideal op fantasy, I know.
Tool developers use the features they are given to build incredible things. They do this work as volunteers in time that is borrowed from the rest of their lives (school, work, family, editing the wikis, etc). The Cloud Services and DBA teams are *very* aware of this and very grateful for the good works that come from these precious investments. I have spent the last two years of my employment at the Foundation seeking to raise awareness of these good works and to find more resources to help the people who are doing them.
The ops seem to be in an asymmetric position of power here. It sure sounds a lot like a take it or leave it situation to me.
Yes, there is an asymmetry. A very small number of us have to make decisions that effect larger numbers. This is true with the Wiki Replicas; it is true with Cloud Services more generally; it is true with on-wiki content creators vs readers. In all of these cases the few attempt to act in the broader best interest of the many. We try to have consultations with representatives of the groups that we are acting on behalf of. We try to use good judgment and past experience to make better decisions tomorrow than we made yesterday. We hope that the positive impacts of our works out weigh the negative impacts. Whether we succeed of fail in these attempts can be a matter of personal opinion. Not everyone will be pleased by every change; this is unfortunate but true.
In this very specific case, I made the final call to cease looking for a technological advance that would allow us to keep the feature of user managed databases co-located with replicated data from the production environment. I did this after much more extensive consultation with my team and the Foundation's DBAs than is reflected in T156869. This had been a topic of internal discussion since the beginning of the project to build a new Wiki Replica cluster. In the end, I felt that the barriers to freely re-routing database query traffic were too large, and the benefits of that freedom too great, to recreate the prior un-replicated table situation on the new cluster. The blog post mentions many of these benefits.
We are still hoping to find a partial solution (https://phabricator.wikimedia.org/T173511) for replicating some non-canonical data to the new cluster. Work on that task has stagnated, but I hope to restart it soon. I think that Jaime has most of a solution in mind at this point which just needs the final details to be worked out before we can begin to implement it. This will not be a 100% solution for all tools, but it will provide some relief.
I know that my responses here will not fix broken tools. I know that tool maintainers experience some amount of fatigue and frustration caused by each new change added to the environment that they are using to build and deliver their solutions. I do hope however that they restore some measure of WP:AGF for the work of the Cloud Services team, the DBA team, and others who are trying every day to make Toolforge and Cloud Services a better place for developing and operating volunteer created technology.
Bryan