Hi all,
The proposed dates and times for the reboots are:
labsdb1001: Monday Oct 30 2017, 14:30 UTC
labsdb1003: Tuesday Nov 07 2017, 14:30 UTC
These servers are very old, and may run into hardware issues during the
reboots, so we are not defining an outage window. I'll keep the list
updated during and after the maintenance.
As always, feel free to reach out with any questions or concerns here or on
#wikimedia-cloud, or on task here -
.
Best,
On Thu, Oct 19, 2017 at 12:44 AM, Jaime Crespo <jcrespo(a)wikimedia.org>
wrote:
Why not
provide a fourth host and have those three servers act as slaves
of it?
While that is a reasonable way to do it- that is exactly how we do it in
production; that is almost impossible to do it for all users.
First, blind read-write split doesn't work; it needs to be curated for
each application, and we do not have control over when reads and writes
happen; so in the case there is lag, most applications will break. Now I
will explain why always there would be lag-
Second, it creates a SPOF on replication, for each individual application.
While 90% of the data services users do reasonable queries and data
structure; a minority do extremely fast imports, create tables on MyISAM
format (which corrupts on every crash) or do other things that would make
the replication lag or break. **That means, because a single user does
heavy writes, all users, including of simple production replica reads would
be affected**. Also, being able to create tables and perform writes on the
wikireplicas caused lots of complains from other users as they often locked
the wiki data, causing lag. As Bryan says, we are open to other methods
(e.g. we could pregenerate/import summary tables from production), but
those should be puppetized and automatized and checked for sanity (InnoDB,
no blocking/breakage of replication, etc.). Creating summary tables through
replication means people can share that data instead of it being replicated
multiple times. Those are not theoretical, we had to blacklist some
databases from replication on toolsdb because they were causing similar
replication issues:
https://phabricator.wikimedia.org/T127164
Another thing we could do is to experiment with FEDERATED/CONECTX tables
(make virtual tables from wikireplicas available from toolsdb), something
that was done in the past, but it has the same problem of userspace joins-
its performance is not perfect.
It is important to note that to take this decision, we checked existing
users of user tables on wikireplicas and saw that a) the number of users of
local databases was much lower than the total users b) in most cases, they
were just used for summary tables INSERT...SELECT (which by the way, was
the main cause of lag on the replica servers), and is trivial to migrate to
SELECT -> user space -> INSERT c) in those cases where actual JOINs are
used, which we believe to be very few, we are open for solutions like the
ones commented above.
Moving to tools db will mean more available resources in most cases, and a
better availability for both the replicas and the user dbs, as if the
master hardware crashes, there is a passive replica (unlike on local
tables).
On Thu, Oct 19, 2017 at 4:01 AM, Platonides <platonides(a)gmail.com> wrote:
It is not
a happy thing for us to force anyone to change their
software, but as explained in the wiki page [0] we can not find a
reliable method to ensure that the same user created tables are
available on all three of the new backend servers
Why not provide a fourth host and have those three servers act as slaves
of it?
Writes go to the first one, but reads and joins can go to the replicas.
I'm afraid that "make the JOINs in user space" may end up on some
cases with the app fetching all the rows of the tool table or the wiki
replica in order to perform a JOIN that used to be straightforward.
And the tools aren't really the place to implement all kinds of
partial-fetch implementations and query estimates (without EXPLAIN,
even!). Unless you know of such an efficient user space JOIN
implementation, perhaps?
Regards
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud
--
Jaime Crespo
<http://wikimedia.org>
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud