Hi!

TL;DR: The Analytics team added automatic failover for the HDFS Namenode. A new daemon is running on the analytics1001/1002 hosts called hadoop-hdfs-zkfc (port 8019) responsible to talk with Zookeeper and execute periodical health checks. Monitoring and Ferm rules has been added. 

More info:

1) https://phabricator.wikimedia.org/T129838
2) https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Automatic_Failover
3) Monitoring and Ferm rules added: https://gerrit.wikimedia.org/r/#/c/279408/

Let me know if you have any questions!

Luca