On Sun, Sep 30, 2012 at 2:32 PM, Erik Moeller erik@wikimedia.org wrote:
On Sun, Sep 30, 2012 at 10:24 AM, Krinkle krinklemail@gmail.com wrote:
I just wanted to make sure you know that there are confirmed and scheduled plans for Wikimedia Labs to have a live db replication arranged between the labs cluster and the wmf production cluster.
That's correct.
The main difficulty at the moment is that the plans for WMF Labs don't seem to include database replication *in a form that makes it a useful, direct replacement for toolserver*. This is a subtle point that is easy for non-technical people to miss when they hear that replication will be available. The toolserver database replicas are useful because there are there, because they are easy to use, and because users can directly join their own databases against them to perform more complicated analysis and data gathering than would be efficient otherwise.
Reading about the plans for labs in the past few days, I have seen the following claims:
* "User databases will not be able to be joined against replicated databases." The reasoning behind this seems to be a misunderstanding of the role of "application logic". For Mediwiki itself, which works with only a few articles at a time, joins can be efficiently simulated within Mediawiki itself, or by making changes to the database schema on the live servers. For a toolserver application that works with millions of articles in a single query, it is silly to essentially re-implement a SQL engine in the application logic - joins of this size, which may require filesorts, should be done at the database server level, not at the application level. That's why we use a database server in the first place.
* "User databases will not be backed up directly, and users will have to back them up manually". This is again a huge step backwards in reliability, as most users don't have the time or experience to do reliable backups of their own databases. The utterly predicable outcome will be that one or more highly-used services will fail and have no backup.
* "User home directories will somehow be deprecated." A key function of the toolserver is "data analysis": users can simply run queries against the replicated database, process the results, and use them to plan things on the wikis. There is no "application" or "project" for this - it is essentially ad hoc manual work. This kind of data analysis could be done from a database dump, but then the data is far out of date. It can be done using api.pm on the live wiki, but that is prohibitively inefficient for queries that have to consider millions of articles.
Looking at the discussions about labs in the past few days, it seems clear to me that labs will be useful for some projects, particularly for developing Mediawiki extensions. But the plans seem to make it needlessly difficult for most of the things that the toolserver is currently used for. The current plans seem to be intentionally preventing toolserver users from simply migrating their tools to labs; the result will be a great leap backwards when/if the toolserver is taken offline. There is some ideological purity in that, but it will result in a huge loss of functionality for the actual wikis, which rely on the existing toolserver in many ways for normal, on-wiki functionality. For example, there are many links in the interface on enwiki to toolserver tools.
I do think it is silly for a WMF chapter to run the toolserver when it is really a vital part of the infrastructure for the live wiki projects. But the right solution is for the WMF to offer a convenient replacement that will not require an unreasonable amount of effort for migration.
- Carl