On Wed, Oct 18, 2017 at 8:01 PM, Platonides <platonides(a)gmail.com> wrote:
It is not a
happy thing for us to force anyone to change their
software, but as explained in the wiki page [0] we can not find a
reliable method to ensure that the same user created tables are
available on all three of the new backend servers
Why not provide a fourth host and have those three servers act as slaves of it?
Writes go to the first one, but reads and joins can go to the replicas.
That scenario makes that single master host a single point of failure
where all writes to the co-located tables would have to be made. That
would in turn mean that taking this one host down for maintenance
would halt all ability to write to the tables.
There is actually an open discussion for specially "curated" datasets
that will likely work like this
(<https://phabricator.wikimedia.org/T173511#3556395>). The details are
still a work in progress, but the current rough idea is that to be
replicated tables/databases will need to meet certain structural
restrictions to enable robust replication and have well defined owners
who will be responsible to assisting in responding to issues related
to their tables. Another likely restriction will be that the tables
must be populated by some sort of batch/bulk loading that can be
reproduced as needed if the master server has to be rebuilt from
scratch.
I'm afraid that "make the JOINs in user
space" may end up on some
cases with the app fetching all the rows of the tool table or the wiki
replica in order to perform a JOIN that used to be straightforward.
And the tools aren't really the place to implement all kinds of
partial-fetch implementations and query estimates (without EXPLAIN,
even!). Unless you know of such an efficient user space JOIN
implementation, perhaps?
I do not have a universal user space JOIN algorithm that I'm hiding,
but there are case by case efficient replacements for many direct JOIN
queries. It's all in the specific details of what data is needed from
each table however. For a straight join the solution may be
interleaved fetches from two open cursors. For a join used to limit
results from another table the solution may be batched queries using a
WHERE x IN (...) constraint.
Working with physically sharded databases is not a completely new
constraint in computer science. I'm confident that reasonable
solutions can be found for most use cases. I would personally be happy
to help people try to find workable solutions on this mailing list, in
Phabricator tasks, or even on irc. I think that there are others in
our community of developers who would be willing to help out as well.
Maybe we can all collaborate to document some common (and uncommon)
solutions on a wikitech help page.
Again, I want to acknowledge that the removal of this long standing
feature is not the ideal outcome. I do not enjoy making decisions that
make things more difficult for our technical community. I don't want
to close the door on looking for better solutions, but the hard
reality is that the labsdb1001 and labsdb1003 are dying. We need to
get all of the traffic off of them as soon as we can. If either of
them has a hardware failure we will not be able to restore them. This
rushed timeline is not ideal, but at the same time I do not want to
lock our new servers into unsustainable configurations to alleviate a
short term pain.
PS. I don't mean to "well actually" your EXPLAIN statement, but with
the MariaDB servers EXPLAIN is available in a slightly modified form
(<https://wikitech.wikimedia.org/wiki/Help:MySQL_queries#Optimizing_queries>)
that works with the intervening view layer and without requiring
elevated user privileges.
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Cloud Services Boise, ID USA
irc: bd808 v:415.839.6885 x6855