<quote name="Risker" date="2015-05-28" time="09:53:31
-0400">
This is strictly a question from an uninvolved
observer. Does this
schedule provide for sufficient time and real-time/hands-on testing before
changes hit the big projects?
Yes. We still have Beta Cluster (production-like environment) which runs
all code merged into master within 10 minutes of it being merged.
An IRC discussion I was following last evening
suggested to me that the
first deploy (to test wikis and
mw.org) probably did not get sufficient
hands-on testing/utilization to surface many issues that would be
significant on production wikis, which means only 24 hours on smaller
non-wikipedia wikis, hoping that any problems will pop up before it's
applied to dewiki, frwiki and enwiki.
Honestly, that's the wrong perspective to take on that incident
yesterday[0]. The issue is one that is hard to identify at low traffic
levels (one that only really manifests itself at Wikipedia-scale with
Wikipedia-scale caching). There will always be issues like this,
unfortunately. The way to mitigate them better is by changing how we
bucket requests to new or old versions of the software on production.
Currently we bucket by domain name/project site. This doesn't give us a
lot of flexibility in testing new versions at scales that can show
issues by not be "everyone". We would need to be able to deploy new
versions based on percentage of overall requests (ie: 5% of all users to
new version, then 10% of all users to new version, then everyone).
Best,
Greg
[0]
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150527-Cookie
--
| Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E |
| identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |