I am preparing the Technology Report for this week's edition of the Signpost on enwiki. (see http://en.wikipedia.org/wiki/User:Aude/Technology_report)
I have some idea about how toolserver works, aware that there are issues with replication and that Yarrow (one of the servers) was down. What is the status now? It sounds like things are not 100% back to normal. How long (estimated) will it take to be back to normal?
There are a number of tools linked from the featured article candidates pages. Are these all working 100%? What other tools are affected? http://en.wikipedia.org/wiki/Wikipedia:Featured_article_candidates/Typhoon_R...)
Any information for the Signpost would be most appreciated.
Cheers, -Aude
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aude:
I have some idea about how toolserver works, aware that there are issues with replication and that Yarrow (one of the servers) was down. What is the status now? It sounds like things are not 100% back to normal. How long (estimated) will it take to be back to normal?
all servers were down for about 1-2 days due to Wikimedia (who host the toolserver) moving the Amsterdam facility from one datacentre to another. as a result of some undetermined problem during the move, the disk array for one of the database servers (yarrow) didn't come up after the move, but the problem wasn't discovered until there was no one in the DC to fix it. Mark was able to visit a couple of days later and fixed the problem.
at this point, the database lag was about 6 days, i.e. 6 days worth of changes had yet to be applied to the toolserver's copy of the database. unfortunately, before the s3 database was able to catch up, a Wikimedia sysadmin deleted the log files that contained these changes (which is commonly done due to shortness of disk space on the master database servers). without these logs, replication cannot happen, so the s3 cluster is not replicating.
as i mentioned in a previous mail [0], we're not sure when this will be repaired, but at this point, i would say it's likely to be after the new servers arrive.
s1 and s2 (which is on a different server) weren't affected.
There are a number of tools linked from the featured article candidates pages. Are these all working 100%? What other tools are affected? http://en.wikipedia.org/wiki/Wikipedia:Featured_article_candidates/Typhoon_R...)
only queries on the s3 cluster, which does not include en.wikipedia.org, are affected.
- river.
[0] http://lists.wikimedia.org/pipermail/toolserver-announce/2009-January/000059...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
River Tarnell:
a Wikimedia sysadmin deleted the log files that contained these changes (which is commonly done due to shortness of disk space on the master database servers).
forgot to mention: i explained the problems this causes us to Wikimedia, and provided a way for them to see which binlogs can be safely deleted, so this hopefully won't happen again.
- river.
Thank you for the quick response. To make sure I understand correctly, this means that tools are running off of old copies of the Wikipedia database (and those for other Wikimedia projects).
The tools seem to work, but are not up-to-date until this is resolved. And, I tried the contributors tool, which seems up-to-date for enwiki and being replicated. But, I made an edit to my Arabic wikipedia user page, and it's not showing up in the Contributors tool. http://toolserver.org/~daniel/WikiSense/Contributors.php
So, it seems the impact is minimal for English Wikipedia users (please correct me if I'm wrong). I can still mention something, since toolserver was down for a few days and people may have wondered about that.
-Aude
On Sat, Jan 10, 2009 at 2:56 AM, River Tarnell < river@loreley.flyingparchment.org.uk> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
River Tarnell:
a Wikimedia sysadmin deleted the log files that contained these changes (which is commonly done due to shortness of disk space on the master
database
servers).
forgot to mention: i explained the problems this causes us to Wikimedia, and provided a way for them to see which binlogs can be safely deleted, so this hopefully won't happen again.
- river.
-----BEGIN PGP SIGNATURE-----
iD8DBQFJaFS2IXd7fCuc5vIRAl6WAJ9kBDSIpTJb8gaHkCKvhcZsM05++gCgipcr 1U4HWyHf95igL161/XzFx4k= =9RM0 -----END PGP SIGNATURE-----
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aude:
Thank you for the quick response. To make sure I understand correctly, this means that tools are running off of old copies of the Wikipedia database (and those for other Wikimedia projects).
yes, this is correct, for databases in the s3 cluster.
- river.
I added the following to the Technology Report, trying to relate what this means for the general user.
The toolserver was down for a couple days around New Year's, due to the servers being moved to from the Amsterdam facility to another data center. After the move, there were problems with one of the database servers (yarrow), which took a few days to fix. While awaiting the fix, a log file was deleted and replication on one of the clusters is not working. This means copies of some Wikipedia and Wikimedia projects are not up-to-date now. However, the replication problem but this is only affecting some of the smaller projects and not English Wikipedia. Some interwiki tools such as CheckUsage (for images) are not fully functional at this time, producing outdated results.
Again, thanks for the quick reply and let me know if what I wrote needs any adjustments.
Regards, Aude
On Sat, Jan 10, 2009 at 3:13 AM, River Tarnell < river@loreley.flyingparchment.org.uk> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aude:
Thank you for the quick response. To make sure I understand correctly,
this
means that tools are running off of old copies of the Wikipedia database (and those for other Wikimedia projects).
yes, this is correct, for databases in the s3 cluster.
- river.
-----BEGIN PGP SIGNATURE-----
iD8DBQFJaFiuIXd7fCuc5vIRAjBPAKC/rmVsswlF2KFtEzMsHSmHmko1lQCgpNav Z8uw/ggmOW69lPzQ1Hds4/o= =DK2G -----END PGP SIGNATURE-----
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
I already found some typos :(
-Aude
On Sat, Jan 10, 2009 at 3:24 AM, Aude audevivere@gmail.com wrote:
I added the following to the Technology Report, trying to relate what this means for the general user.
The toolserver was down for a couple days around New Year's, due to the servers being moved to from the Amsterdam facility to another data center. After the move, there were problems with one of the database servers (yarrow), which took a few days to fix. While awaiting the fix, a log file was deleted and replication on one of the clusters is not working. This means copies of some Wikipedia and Wikimedia projects are not up-to-date now. However, the replication problem but this is only affecting some of the smaller projects and not English Wikipedia. Some interwiki tools such as CheckUsage (for images) are not fully functional at this time, producing outdated results.
Again, thanks for the quick reply and let me know if what I wrote needs any adjustments.
Regards, Aude
On Sat, Jan 10, 2009 at 3:13 AM, River Tarnell < river@loreley.flyingparchment.org.uk> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aude:
Thank you for the quick response. To make sure I understand correctly,
this
means that tools are running off of old copies of the Wikipedia database (and those for other Wikimedia projects).
yes, this is correct, for databases in the s3 cluster.
- river.
-----BEGIN PGP SIGNATURE-----
iD8DBQFJaFiuIXd7fCuc5vIRAjBPAKC/rmVsswlF2KFtEzMsHSmHmko1lQCgpNav Z8uw/ggmOW69lPzQ1Hds4/o= =DK2G -----END PGP SIGNATURE-----
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
-- Aude
toolserver-l@lists.wikimedia.org