On Wed, Jun 8, 2016 at 4:48 PM, Antoine Musso hashar+wmf@free.fr wrote:
On 08/06/16 18:47, Antoine Musso wrote:
Le 08/06/2016 à 15:02, Antoine Musso a écrit :
The operation team has worked hard this European morning to backup files, investigate the raid issue and setup a new host.
We are in the process of reinstalling everything on the new host and bring back Jenkins and Zuul on it.
No ETA yet, since a 5 years old boxes must have hidden issues which makes it hard to estimate how long it would need to fully recover.
A status update:
Ops (Jaime, Faidon, Mark, Chris) had a disk replaced and the raid array is rebuilding right now. Should take roughly an hour from now. If the disk and raid are confirmed to be fine, we would bring back Jenkins and Zuul.
A new server has been installed contint1001. Jenkins data are being copied there. We would need to adjust a few network rules and update IP address in configuration files then attempt to switch to that new setup.
Main task is: https://phabricator.wikimedia.org/T137265
The CI service is back since 19:00 UTC after a disk got replaced and the RAID array rebuild successfully.
Thanks hashar and everyone who helped out.
Cheers, Katie
The issue might well occurs again and we would move the various services out of the server (gallium).
-- Antoine Musso
QA mailing list QA@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/qa
wikitech-l@lists.wikimedia.org