On Sun, Sep 30, 2012 at 2:32 PM, Erik Moeller <erik(a)wikimedia.org> wrote:
On Sun, Sep 30, 2012 at 10:24 AM, Krinkle
<krinklemail(a)gmail.com> wrote:
I just wanted to make sure you
know that there are confirmed and scheduled plans for Wikimedia Labs to have a
live db replication arranged between the labs cluster and the wmf production
cluster.
That's correct.
The main difficulty at the moment is that the plans for WMF Labs don't
seem to include database replication *in a form that makes it a
useful, direct replacement for toolserver*. This is a subtle point
that is easy for non-technical people to miss when they hear that
replication will be available. The toolserver database replicas are
useful because there are there, because they are easy to use, and
because users can directly join their own databases against them to
perform more complicated analysis and data gathering than would be
efficient otherwise.
Reading about the plans for labs in the past few days, I have seen the
following claims:
* "User databases will not be able to be joined against replicated
databases." The reasoning behind this seems to be a misunderstanding
of the role of "application logic". For Mediwiki itself, which works
with only a few articles at a time, joins can be efficiently simulated
within Mediawiki itself, or by making changes to the database schema
on the live servers. For a toolserver application that works with
millions of articles in a single query, it is silly to essentially
re-implement a SQL engine in the application logic - joins of this
size, which may require filesorts, should be done at the database
server level, not at the application level. That's why we use a
database server in the first place.
* "User databases will not be backed up directly, and users will have
to back them up manually". This is again a huge step backwards in
reliability, as most users don't have the time or experience to do
reliable backups of their own databases. The utterly predicable
outcome will be that one or more highly-used services will fail and
have no backup.
* "User home directories will somehow be deprecated." A key function
of the toolserver is "data analysis": users can simply run queries
against the replicated database, process the results, and use them to
plan things on the wikis. There is no "application" or "project" for
this - it is essentially ad hoc manual work. This kind of data
analysis could be done from a database dump, but then the data is far
out of date. It can be done using api.pm on the live wiki, but that is
prohibitively inefficient for queries that have to consider millions
of articles.
Looking at the discussions about labs in the past few days, it seems
clear to me that labs will be useful for some projects, particularly
for developing Mediawiki extensions. But the plans seem to make it
needlessly difficult for most of the things that the toolserver is
currently used for. The current plans seem to be intentionally
preventing toolserver users from simply migrating their tools to labs;
the result will be a great leap backwards when/if the toolserver is
taken offline. There is some ideological purity in that, but it will
result in a huge loss of functionality for the actual wikis, which
rely on the existing toolserver in many ways for normal, on-wiki
functionality. For example, there are many links in the interface on
enwiki to toolserver tools.
I do think it is silly for a WMF chapter to run the toolserver when it
is really a vital part of the infrastructure for the live wiki
projects. But the right solution is for the WMF to offer a convenient
replacement that will not require an unreasonable amount of effort for
migration.
- Carl