A few notes about the switch, in no particular order. I probably missed a few points, feel free to add your own.
Timeline:
09:35: deploy sizing of HHVM curl named pools 12:22: activating HTTPS + connection pooling, but staying on eqiad 12:37: stop using HTTPS+pooling for labswiki 12:46: point the codfw label back to the codfw cluster 13:22: Fix TTMServer elastic config 13:50: switch CirrusSearch traffic to codfw 13:53: rollback 14:35: switch mw1017 to codfw for CirrusSearch 14:47: switch all CirrusSearch traffic to codfw
Issues found and fixed: * labswiki does not run on HHVM >= 3.9.0 but on zend, search is broken. Fixed by an exception in configuration (https://gerrit.wikimedia.org/r/#/c/282145/). Making the CirrusSearch\Elastica\PooledHttp class more robust would be nice, but might not be possible - https://phabricator.wikimedia.org/T132075) * All traffic (including updates) sent to eqiad : copy/paste error in wmf-config/CirrusSearch-production.php, fixed by https://gerrit.wikimedia.org/r/#/c/282147/. We lost some updates, re-indexing in progress. * Issue with TTM configuration broken by the change in CirrusSearch config. Fixed by https://gerrit.wikimedia.org/r/#/c/282154/. * All wikis in error for ~5 minutes. Issue in handling of array in CirrusSearch configuration. Fixed by https://gerrit.wikimedia.org/r/#/c/282163/1.
Issues discovered but not fixed: * TTM does not handle multi DC - writes are done only to eqiad - which implies ttmserver index does not exist in codfw - in particular, saves do not work, which makes fixing this a blocker for the switch - dcausse and Nikerabbit seem to have a quick fix in mind - phab task created: https://phabricator.wikimedia.org/T132076
General lessons: * Elasticsearch as measured from mediawiki has a response time increase of ~50[ms] for all query types, except MoreLike queries (which were already sent to codfw). MoreLike queries have a response time decrease of ~15[ms]. Those differences seem to be fairly constant (similar across all percentiles). With a completey biased and unscientific experiment of using wikipedia myself, I feel those differences the most on the autocompletion of the search box. * unit testing configuration is hard, testing it outside of prod is mostly impossible * testing first be deploying manually on our tests servers (mw1017, ...) definitely make sense for all non trivial changes * labswiki is running on Zend, not HHVM, I need to remember that and to try to understand why * I know remember why I like strongly statically typed languages