A few notes about the switch, in no particular order. I probably
missed a few points, feel free to add your own.
Timeline:
09:35: deploy sizing of HHVM curl named pools
12:22: activating HTTPS + connection pooling, but staying on eqiad
12:37: stop using HTTPS+pooling for labswiki
12:46: point the codfw label back to the codfw cluster
13:22: Fix TTMServer elastic config
13:50: switch CirrusSearch traffic to codfw
13:53: rollback
14:35: switch mw1017 to codfw for CirrusSearch
14:47: switch all CirrusSearch traffic to codfw
Issues found and fixed:
* labswiki does not run on HHVM >= 3.9.0 but on zend, search is
broken. Fixed by an exception in configuration
(
https://gerrit.wikimedia.org/r/#/c/282145/). Making the
CirrusSearch\Elastica\PooledHttp class more robust would be nice, but
might not be possible -
https://phabricator.wikimedia.org/T132075)
* All traffic (including updates) sent to eqiad : copy/paste error in
wmf-config/CirrusSearch-production.php, fixed by
https://gerrit.wikimedia.org/r/#/c/282147/. We lost some updates,
re-indexing in progress.
* Issue with TTM configuration broken by the change in CirrusSearch
config. Fixed by
https://gerrit.wikimedia.org/r/#/c/282154/.
* All wikis in error for ~5 minutes. Issue in handling of array in
CirrusSearch configuration. Fixed by
https://gerrit.wikimedia.org/r/#/c/282163/1.
Issues discovered but not fixed:
* TTM does not handle multi DC
- writes are done only to eqiad
- which implies ttmserver index does not exist in codfw
- in particular, saves do not work, which makes fixing this a
blocker for the switch
- dcausse and Nikerabbit seem to have a quick fix in mind
- phab task created:
https://phabricator.wikimedia.org/T132076
General lessons:
* Elasticsearch as measured from mediawiki has a response time
increase of ~50[ms] for all query types, except MoreLike queries
(which were already sent to codfw). MoreLike queries have a response
time decrease of ~15[ms]. Those differences seem to be fairly constant
(similar across all percentiles). With a completey biased and
unscientific experiment of using wikipedia myself, I feel those
differences the most on the autocompletion of the search box.
* unit testing configuration is hard, testing it outside of prod is
mostly impossible
* testing first be deploying manually on our tests servers (mw1017,
...) definitely make sense for all non trivial changes
* labswiki is running on Zend, not HHVM, I need to remember that and
to try to understand why
* I know remember why I like strongly statically typed languages
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation