-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Aude:
I have some idea about how toolserver works, aware
that there are issues
with replication and that Yarrow (one of the servers) was down. What is the
status now? It sounds like things are not 100% back to normal. How long
(estimated) will it take to be back to normal?
all servers were down for about 1-2 days due to Wikimedia (who host the
toolserver) moving the Amsterdam facility from one datacentre to another. as a
result of some undetermined problem during the move, the disk array for one of
the database servers (yarrow) didn't come up after the move, but the problem
wasn't discovered until there was no one in the DC to fix it. Mark was able to
visit a couple of days later and fixed the problem.
at this point, the database lag was about 6 days, i.e. 6 days worth of changes
had yet to be applied to the toolserver's copy of the database. unfortunately,
before the s3 database was able to catch up, a Wikimedia sysadmin deleted the
log files that contained these changes (which is commonly done due to shortness
of disk space on the master database servers). without these logs, replication
cannot happen, so the s3 cluster is not replicating.
as i mentioned in a previous mail [0], we're not sure when this will be
repaired, but at this point, i would say it's likely to be after the new
servers arrive.
s1 and s2 (which is on a different server) weren't affected.
There are a number of tools linked from the featured
article candidates
pages. Are these all working 100%? What other tools are affected?
http://en.wikipedia.org/wiki/Wikipedia:Featured_article_candidates/Typhoon_…
only queries on the s3 cluster, which does not include
en.wikipedia.org, are
affected.
- river.
[0]
http://lists.wikimedia.org/pipermail/toolserver-announce/2009-January/00005…
-----BEGIN PGP SIGNATURE-----
iD8DBQFJaFPEIXd7fCuc5vIRAuEwAJ9S/NJ7hMO9iUZsJTdbJff3xFgWiACfT5NF
N7sMwKSsYDEn1nd+Y4WS6nY=
=x2pX
-----END PGP SIGNATURE-----