Bryan,
thank you for your thoughtful response. It seems I have done you some injustice. I appreciate your due diligence such as the tool you wrote to list database usage.
However especially in light of this tool I have trouble understanding how the decision could have been made to beak all the tools that depend on server join with user tables. I'm sure there is a lot of personal opinion going into this, and obviously my perspective and priorities as a tool developer id very different from your as an op. For me it genuinely looked like everything has been working fine before the change and now my biggest tool is destroyed. Hard to see the upside here. 
I'm probably guilty of having my head stuck in the sand, as somehow I have been oblivious to your multiple communications channels. But in my defense surely you have a way to associate user accounts with emails? 
So was there no way to leave a server up where the joins are still possible? It wouldn't have to have the same uptime guarantees...
Cheers,
Daniel
Idaho Falls, ID USA

On Sat, Dec 23, 2017 at 8:07 PM Bryan Davis <bd808@wikimedia.org> wrote:
On Sat, Dec 23, 2017 at 5:28 PM, Daniel Schwen <lists@schwen.de> wrote:
> I do appreciate that the ops team is working to improve reliability and
> performance of the database access. Unfortunately it seems to me that there
> is a disconnect between ops and tool devs. I wonder if the ops actually
> looked at how many user databases have been created and how frequently they
> got accessed (all that info should be readily available to them). The logs
> would also have told the ops which users relied in user DBs on the project
> DB servers. A direct email ahead of time would have gone a long way.

As noted previously in this thread, the breaking change was first
announced in the blog post about the new Wiki Replica servers
(<https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/>)
on 2017-09-25. The TL;DR and a link to the blog post were also sent to
labs-announce (now cloud-announce) at that time:
<https://lists.wikimedia.org/pipermail/labs-announce/2017-September/000256.html>

Following that "soft" announcement:
* I built a tool at <https://tools.wmflabs.org/tool-db-usage/> to show
all of the tool owned databases that would be effected by the change.
* I created a page on wikitech describing the timeline and impact and
providing a link to the tool:
<https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown>
* The timeline was announced on the cloud-announce mailing list on
2017-10-19: <https://lists.wikimedia.org/pipermail/cloud-announce/2017-October/000005.html>
* MassMessage was used to notify the maintainers of tools that Nick
Wilson and I could identify via their wikitech talk pages. Example at
<https://wikitech.wikimedia.org/w/index.php?title=User_talk:Andrew_Bogott&diff=1775669&oldid=1773948>

I tried pretty hard here to make sure that tool maintainers who were
going to be effected had months of notice. Obviously this notice did
not reach everyone and for that I am sorry. Making announcements to
1500 users is difficult. The cloud-announce mailing list is really the
best way that we as administrators have to reach out to people about
sweeping changes like this. We can't force anyone to subscribe or to
read the messages however.

> The phabricator post contains the same language I've heard many times
> before: The tools devs shouldn't have used the feature anyways. To that I
> say, well, we still did and it worked great.

I may be missing it, but I do not see anywhere on
<https://phabricator.wikimedia.org/T156869> that any of the
participants chastised the tool developers for using the feature. If I
did say something that was taken that way, I apologize.

> Volunteer developers have a
> limited time budged with which they create tools that large amounts of users
> (editors and readers alike) rely on. That is just the reality of things, and
> it is not the ideal op fantasy, I know.

Tool developers use the features they are given to build incredible
things. They do this work as volunteers in time that is borrowed from
the rest of their lives (school, work, family, editing the wikis,
etc). The Cloud Services and DBA teams are *very* aware of this and
very grateful for the good works that come from these precious
investments. I have spent the last two years of my employment at the
Foundation seeking to raise awareness of these good works and to find
more resources to help the people who are doing them.

> The ops seem to be in an asymmetric
> position of power here. It sure sounds a lot like a take it or leave it
> situation to me.

Yes, there is an asymmetry. A very small number of us have to make
decisions that effect larger numbers. This is true with the Wiki
Replicas; it is true with Cloud Services more generally; it is true
with on-wiki content creators vs readers. In all of these cases the
few attempt to act in the broader best interest of the many. We try to
have consultations with representatives of the groups that we are
acting on behalf of. We try to use good judgment and past experience
to make better decisions tomorrow than we made yesterday. We hope that
the positive impacts of our works out weigh the negative impacts.
Whether we succeed of fail in these attempts can be a matter of
personal opinion. Not everyone will be pleased by every change; this
is unfortunate but true.

In this very specific case, I made the final call to cease looking for
a technological advance that would allow us to keep the feature of
user managed databases co-located with replicated data from the
production environment. I did this after much more extensive
consultation with my team and the Foundation's DBAs than is reflected
in T156869. This had been a topic of internal discussion since the
beginning of the project to build a new Wiki Replica cluster. In the
end, I felt that the barriers to freely re-routing database query
traffic were too large, and the benefits of that freedom too great, to
recreate the prior un-replicated table situation on the new cluster.
The blog post mentions many of these benefits.

We are still hoping to find a partial solution
(<https://phabricator.wikimedia.org/T173511>) for replicating some
non-canonical data to the new cluster. Work on that task has
stagnated, but I hope to restart it soon. I think that Jaime has most
of a solution in mind at this point which just needs the final details
to be worked out before we can begin to implement it. This will not be
a 100% solution for all tools, but it will provide some relief.


I know that my responses here will not fix broken tools. I know that
tool maintainers experience some amount of fatigue and frustration
caused by each new change added to the environment that they are using
to build and deliver their solutions. I do hope however that they
restore some measure of WP:AGF for the work of the Cloud Services
team, the DBA team, and others who are trying every day to make
Toolforge and Cloud Services a better place for developing and
operating volunteer created technology.

Bryan
--
Bryan Davis              Wikimedia Foundation    <bd808@wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services          Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud