-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aude:
I have some idea about how toolserver works, aware that there are issues with replication and that Yarrow (one of the servers) was down. What is the status now? It sounds like things are not 100% back to normal. How long (estimated) will it take to be back to normal?
all servers were down for about 1-2 days due to Wikimedia (who host the toolserver) moving the Amsterdam facility from one datacentre to another. as a result of some undetermined problem during the move, the disk array for one of the database servers (yarrow) didn't come up after the move, but the problem wasn't discovered until there was no one in the DC to fix it. Mark was able to visit a couple of days later and fixed the problem.
at this point, the database lag was about 6 days, i.e. 6 days worth of changes had yet to be applied to the toolserver's copy of the database. unfortunately, before the s3 database was able to catch up, a Wikimedia sysadmin deleted the log files that contained these changes (which is commonly done due to shortness of disk space on the master database servers). without these logs, replication cannot happen, so the s3 cluster is not replicating.
as i mentioned in a previous mail [0], we're not sure when this will be repaired, but at this point, i would say it's likely to be after the new servers arrive.
s1 and s2 (which is on a different server) weren't affected.
There are a number of tools linked from the featured article candidates pages. Are these all working 100%? What other tools are affected? http://en.wikipedia.org/wiki/Wikipedia:Featured_article_candidates/Typhoon_R...)
only queries on the s3 cluster, which does not include en.wikipedia.org, are affected.
- river.
[0] http://lists.wikimedia.org/pipermail/toolserver-announce/2009-January/000059...