I had several emails in my inbox this morning because my tools were returning incorrect data.
It appears that about 6 hours ago enwiki, and only enwiki, stopped replicating. I now see that the Wikimedia developers have moved enwiki to its own cluster without warning or coordination. Not like they'd have anyone to warn: If I can't contact someone with authority, no doubt that they are unable as well.
Based on the prior track record I expect this to never be fixed, just as text replication was never fixed and the replication of the asia cluster wikis was never fixed.
Toolserver had become mostly useless for many of my projects without high speed text access, now it is almost completely useless for all of my projects... and I'm tired of catching flack for the unreliability of the server. People have depended on the tools I provided, but are constantly let down by the unreliability of the service.
When I was granted access and when I spent many hours writing software I had an expectation that someone would be at least trying to maintain the system. I never expected that it would be ignored, that my work would go to waste, and that if I offered to do the work I too would be ignored. When I provided tools that allowed enwiki users to adjust their processes and work more effectively, I believed that they could rely on these tools working most of the time. I understand now that I was mistaken.
I am tired of wasting my time.
Because I can't even expect the nonexistent toolserver administration to perform the trivial action of turning off my account, I have deleted my ssh authorized key... thus my account is effectively disabled. So don't worry, you can go on doing nothing.
On 4/10/06, kate@zedler.knams.wikimedia.org kate@zedler.knams.wikimedia.org wrote:
hello,
the account expiration date was originally scheduled for April 1st, but has been extended to May 1st. on this date, all accounts will expire (and no longer be usable) except those which have had the expiration date extended.
if you have an account, and you would like to keep it:
- if you have one or more working projects, please describe these (preferably with examples, URLs, etc.)
- if you do not yet have anything ready (particularly if you're a new user), please describe what you intend to work on. a rough estimate of when you expect it to be ready would be useful. if some issue is holding you up (e.g. lack of text access), please mention that.
if you no longer wish to use your account, please say so.
this information should be mailed to dab@daniel.baur4.info and cc'd to zedler-admins@wikimedia.org. (there's no particular deadline, but if you wait until one day before the expiration, you might find that your account expires because no-one managed to look at it yet...)
Gregory Maxwell wrote:
Because I can't even expect the nonexistent toolserver administration to perform the trivial action of turning off my account, I have deleted my ssh authorized key... thus my account is effectively disabled. So don't worry, you can go on doing nothing.
I'm in favour of giving you root access on the toolserver and a pppuser account for the main cluster, then you can do what you can with it. If you accepted the role and the others in control of the toolserver consented, I'd wish you luck, but I'm not sure how much you'd be able to do. The site is growing rapidly, and so is the data size and write load. We currently have somewhere around 500-600GB of InnoDB data across all our clusters. Zedler has 625GB total space in its main partition. If the data fits now, it won't leave much room, and it won't fit for long.
There's a reason we are splitting our databases into lots of different indepdently replicating clusters, that's because our servers were having increasing difficulty keeping up with the write load while still performing a useful amount of read load. Zedler is experiencing that problem too, there are apparently times of the day or week when it can't keep up with the write rate, given its existing read rate.
Now, with careful optimisation of the read load, including halting all non-critical reads when lag is over a certain value, it may be possible to keep the lag low, even when replicating the entire multi-master data set. But that will mean limiting the number of expensive queries. And how long will it last, anyway?
The point I'm getting at is that maybe it's not reasonable to expect a server of zedler's type to replicate every transaction that occurs anywhere in our site. If we want a complete, up-to-date copy of all data at knams, we probably need to dedicate more hardware to it.
In any case, I think giving Gregory root access would be a positive step forward. He is obviously very motivated towards the toolserver project (or at least was), and it's my understanding that he's capable as well. Could Kate and/or the e.V. give some comment on this?
-- Tim Starling
Tim Starling wrote:
There's a reason we are splitting our databases into lots of different indepdently replicating clusters, that's because our servers were having increasing difficulty keeping up with the write load while still performing a useful amount of read load. Zedler is experiencing that problem too, there are apparently times of the day or week when it can't keep up with the write rate, given its existing read rate.
I don't thnik Gregory disputed the need to keep the site running, and if makeing an en cluster is what it takes to do this, so be it. His point is the frustration from hearing "Hey, we made a new cluster. And by the way, the toolserver won't work with that, so everyone wo invested month of work into the tools there is s*****d. So what." Taking the wheels and engine from a car and then giving someone the keys isn't really helpful in such a situation.
Magnus (who keeps trying no matter the treatment ;-)
Magnus Manske wrote:
I don't thnik Gregory disputed the need to keep the site running, and if makeing an en cluster is what it takes to do this, so be it. His point is the frustration from hearing "Hey, we made a new cluster. And by the way, the toolserver won't work with that, so everyone wo invested month of work into the tools there is s*****d. So what." Taking the wheels and engine from a car and then giving someone the keys isn't really helpful in such a situation.
The toolserver will work with it, just like it will work with external storage. I know that Gregory knows this because we were discussing it on toolserver-l on March 29:
Gregory Maxwell wrote:
On 3/29/06, Tim Starling <tstarling at wikimedia.org> wrote:
[...]
It should be possible to set up 5 MySQL instances and have each of them replicating from a different master. Is anyone volunteering to set up those instances? Maybe we need to give root access to someone who actually cares about this stuff.
I, in effect, volunteered to do it when I pointed out that would solve that issue on this list previously. I've setup replication with other RDMSes, never mysql... but there is already a configured instance to work from. I don't even see how this could be even considered a challenge,... it's the ongoing maintenance that carries the real burden.
wikitech-l@lists.wikimedia.org