<div dir="ltr"><div class="gmail_quote"><div>Hi! Replies inline...</div><div dir="ltr"><br></div><div dir="ltr">On Wed, Jan 25, 2017 at 10:59 AM Stas Malyshev <<a href="mailto:smalyshev@wikimedia.org">smalyshev@wikimedia.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi!<br class="gmail_msg">
<br class="gmail_msg">
Yesterday we've had search broken on <a href="http://mediawiki.org" rel="noreferrer" class="gmail_msg" target="_blank">mediawiki.org</a> for some time, which<br class="gmail_msg">
is seemingly caused by a security patch not being applied correctly to<br class="gmail_msg">
production code. I am not sure what chain of events led to it (and<br class="gmail_msg">
please feel free to correct me if I misunderstood anything) but it makes<br class="gmail_msg">
me a bit worried, for the following reasons:<br class="gmail_msg">
<br class="gmail_msg">
1. We had broken code which is not in gerrit or actually anywhere in<br class="gmail_msg">
repos running in production.<br class="gmail_msg">
<br class="gmail_msg"></blockquote><div><br></div><div>Half-true. The code is not in gerrit but it *is* committed locally to the</div><div>deployment branches on the repos. They're always rebased on top</div><div>so they're immediately visible in the git log.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
2. As far as I know, none of people knowing the actual code that was<br class="gmail_msg">
broken knew there were problems with applying the patch (even though it<br class="gmail_msg">
has .failed as filename in filesystem, so I wonder what happened there)<br class="gmail_msg">
or that this particular piece of code was substantially different in<br class="gmail_msg">
production than in repos. Of course we've figured it out but it was not<br class="gmail_msg">
immediately known AFAIK.<br class="gmail_msg">
<br class="gmail_msg"></blockquote><div><br></div><div>Yes, we did know there were problems applying these patches. This current</div><div>set of security patches have been particularly problematic in rebasing each</div><div>week for the train. As discussed on IRC last night, the problem comes in</div><div>trying to sort out the merge conflicts. Most of the time we get it right and</div><div>nobody notices, but the context here wasn't great and we made a mistake.</div><div><br></div><div>The *.failed was just a lack of cleanup after resolving the conflicts...the</div><div>patches *were* applied.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
3. We knew about the problem only after stumbling on it by chance on<br class="gmail_msg">
<a href="http://mediawiki.org" rel="noreferrer" class="gmail_msg" target="_blank">mediawiki.org</a>. It could as well be less visible but much more critical<br class="gmail_msg">
and would be deployed further by production train before we noticed it.<br class="gmail_msg">
<br class="gmail_msg"></blockquote><div><br></div><div>Testing in beta/labs is great, but security patches by definition won't be</div><div>there.</div><div><br></div><div>That's the reason for group0 deployments on Tuesday and wider deployments</div><div>on Wed/Thurs. There is a group0 dashboard in Logstash that *everyone*</div><div>doing the train conducting should be watching. New things always need triage,</div><div>if not resolution prior to rolling out further.</div><div><br></div><div>Had we not gotten a fix in place yesterday, I would've halted the train for</div><div>wider deployment.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
4. We have out-of-repo code living on our production long term, while<br class="gmail_msg">
code underneath it changes, and developers of the code being patched are<br class="gmail_msg">
not aware of this going on. These changes may as well have invalidated<br class="gmail_msg">
the security patch instead of producing visible error and we wouldn't know.<br class="gmail_msg">
<br class="gmail_msg"></blockquote><div><br></div><div>Yep, this has always been a risk and has put a ton of cognitive load on</div><div>the people running the train.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I understand the challenges with providing and applying security patches<br class="gmail_msg">
for code that is both public and production under heavy use. But the<br class="gmail_msg">
failure scenario we've had is rather worrying for me and I wonder if we<br class="gmail_msg">
could do better in handling security patches to lower the probability of<br class="gmail_msg">
this happening in the future.<br class="gmail_msg">
<br class="gmail_msg"></blockquote><div><br></div><div>Yes, there are major problems with how we're doing this now. </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I don't know enough about security patches handling process to propose<br class="gmail_msg">
anything, but I hope that people who do would help with this, so I'd<br class="gmail_msg">
like to initiate the discussion about it.<br class="gmail_msg"><br class="gmail_msg"></blockquote><div><br></div><div>So, the security patch process is basically thus:</div><div>0) Security patches are stored in /srv/patches/$VERSION/ on deployment</div><div>masters (tin/mira)</div><div>1) When the new branch is cut and checked out to the deployment master</div><div>on tuesday, patches are copied from the previous version in /srv/patches/</div><div>2) Apply them and pray they don't conflict</div><div>3) Rebase patches if needed, save updated versions to /srv/patches/</div><div>4) Sync code live</div><div><br></div><div>This falls apart in (3), which requires the deployer to patch code with which</div><div>they very likely are unfamiliar. We (mostly Mukunda) have been working on</div><div>scap functionality to automate steps 1-2, but 3 is always going to happen.</div><div><br></div><div>To help the situation, I think a few things are necessary:</div><div>- We (me, mostly) need to *drastically* cut down the time it takes to get</div><div>security releases public. Carrying non-master fixes in production should</div><div>be the exception, not the norm. This also helps us keep non-production</div><div>wikis safer (I'm looking at you, beta)</div><div>- Using merges/shared git history between deployment branches instead</div><div>of patchfiles would probably simplify a lot of this, needs further thinking</div><div>through though</div><div>- A *clear* owner for any security patches in production (without having</div><div>to dig through Phabricator). This person must be consulted on any non-trivial</div><div>conflict, imho.</div><div><br></div><div>We need to do better...</div><div><br></div><div>-Chad</div></div></div>