On Sat, Aug 17, 2013 at 8:40 PM, Ken Snider ksnider@wikimedia.org wrote:
On Aug 17, 2013, at 1:33 PM, rupert THURNER rupert.thurner@gmail.com wrote:
hi faidon, i do not think you personally and WMF are particularly helpful in accepting contributions. because you:
- do not communicate openly the problems
- do not report upstream publically
- do not ask for help, and even if it gets offered you just ignore it
with quite some arrogance
Rupert, please don't call out or attack specific people. We're all on the same team, and I can
... let me change the title, as this is not site hardening any more.
Further, Ops in general, and Faidon in particular, routinely report issues upstream. Our recent bug reports or patches to Varnish and Ceph are two examples that easily come to mind. Faidon was (rightly) attempting to restore service first ...
yes ken, you are right, lets stick to the issues at hand: (1) by when you will finally decide to invest the 10 minutes and properly trace the gitblit application? you have the commands in the ticket: https://bugzilla.wikimedia.org/show_bug.cgi?id=51769
(2) by when you will adjust your operating guideline, so it is clear to faidon, ariel and others that 10 minutes tracing of an application and getting a holistic view is mandatory _before_ restoring the service, if it goes down for so often, and for days every time. the 10 minutes more can not be noticed if it is gone for more than a day.
(3) how you will handle offers to help out of the community in future. like in the gitblit case, i offered to help tracing the problem while the service was down. max semenik now reported that gitblit should set rel="nofollow".
best regards, rupert.
yes ken, you are right, lets stick to the issues at hand: (1) by when you will finally decide to invest the 10 minutes and properly trace the gitblit application? you have the commands in the ticket: https://bugzilla.wikimedia.org/show_bug.cgi?id=51769
(2) by when you will adjust your operating guideline, so it is clear to faidon, ariel and others that 10 minutes tracing of an application and getting a holistic view is mandatory _before_ restoring the service, if it goes down for so often, and for days every time. the 10 minutes more can not be noticed if it is gone for more than a day.
What information are you hoping to get from a trace that isn't currently known?
I'm not involved with the issue, and don't know specifics, but reading the bug it sounds like there isn't any information we need from a trace at the moment. (It sounds like there was a point in the debugging stage where that would be useful, but that's in the past)
If you want people to trace something, you should justify why it will help you fix the issue/discover what's going on/etc. I would consider it inappropriate for ops to wait 10 minutes before restarting something in order to get a stack trace, if we didn't need the info in the trace.
--bawolff
On Sat, Aug 17, 2013 at 10:48 PM, bawolff bawolff+wn@gmail.com wrote:
yes ken, you are right, lets stick to the issues at hand: (1) by when you will finally decide to invest the 10 minutes and properly trace the gitblit application? you have the commands in the ticket: https://bugzilla.wikimedia.org/show_bug.cgi?id=51769
(2) by when you will adjust your operating guideline, so it is clear to faidon, ariel and others that 10 minutes tracing of an application and getting a holistic view is mandatory _before_ restoring the service, if it goes down for so often, and for days every time. the 10 minutes more can not be noticed if it is gone for more than a day.
What information are you hoping to get from a trace that isn't currently known?
if a web application dies or stops responding this can be (1) caused by too many requests for the hardware it runs on. which can be influenced from outside the app by robots.txt, cache, etc. and inside the app by links e.g using "nofollow". but it can be (2) influenced by the application itself. a java application uses more or less operating system resources depending on how it is written. one might find this out by just reading the code. having a trace helps a lot here. a trace may reveal locking problems in case of multi threading, string operations causing OS calls for every character, creating and garbage collecting objects, and 100s of others. it is not necessary to wait until it stalls again to get the trace. many things can be seen during normal operations as well.
so i hope to get (2). (1) was handled ok in my opinion.
rupert
On Sun, Aug 18, 2013 at 5:37 AM, rupert THURNER rupert.thurner@gmail.comwrote:
On Sat, Aug 17, 2013 at 10:48 PM, bawolff bawolff+wn@gmail.com wrote:
yes ken, you are right, lets stick to the issues at hand: (1) by when you will finally decide to invest the 10 minutes and properly trace the gitblit application? you have the commands in the ticket: https://bugzilla.wikimedia.org/show_bug.cgi?id=51769
(2) by when you will adjust your operating guideline, so it is clear to faidon, ariel and others that 10 minutes tracing of an application and getting a holistic view is mandatory _before_ restoring the service, if it goes down for so often, and for days every time. the 10 minutes more can not be noticed if it is gone for more than a day.
What information are you hoping to get from a trace that isn't currently
known? if a web application dies or stops responding this can be (1) caused by too many requests for the hardware it runs on. which can be influenced from outside the app by robots.txt, cache, etc. and inside the app by links e.g using "nofollow". but it can be (2) influenced by the application itself. a java application uses more or less operating system resources depending on how it is written. one might find this out by just reading the code. having a trace helps a lot here. a trace may reveal locking problems in case of multi threading, string operations causing OS calls for every character, creating and garbage collecting objects, and 100s of others. it is not necessary to wait until it stalls again to get the trace. many things can be seen during normal operations as well.
so i hope to get (2). (1) was handled ok in my opinion.
As you've been told numerous times in the past, we already determined what this specific issue is. It's generating zip files for a large number of spider requests. What is the point of tracing that? Are you going to do optimizations on zip generation? Spiders don't need to index the zips, so we disallowed it. There's no point in wasting time debugging a problem we have no plans on solving.
If you want to reproduce this, set up gitblit, clone a number of repos, then crawl your mirror. After doing so, generate your own stacktrace. We're not going to waste our time with this, so drop it.
That said, if we have an issue in the future not related to the zips, maybe it's worth generating a stacktrace for that.
- Ryan
On 18/08/13 07:37, rupert THURNER wrote:
a trace may reveal locking problems in case of multi threading
CPU usage was high, so there were no locking problems.
, string operations causing OS calls for every character,
System CPU was negligible.
creating and garbage collecting objects, and 100s of others.
Profiling of relevant requests, with the site up, could answer this and hundreds of similar questions, more effectively than a single stack trace collected from a process with high CPU. It will not answer the question "why was Google downloading these files with such high concurrency", which apparently is the relevant question here.
it is not necessary to wait until it stalls again to get the trace.
There was no stall -- it was a CPU overload, which is a different thing. Collecting stack traces certainly can help with stall diagnosis.
-- Tim Starling
On Sat, Aug 17, 2013 at 10:19:10PM +0200, rupert THURNER wrote:
(2) by when you will adjust your operating guideline, so it is clear to faidon, ariel and others that 10 minutes tracing of an application and getting a holistic view is mandatory _before_ restoring the service, if it goes down for so often, and for days every time. the 10 minutes more can not be noticed if it is gone for more than a day.
I think you're making several incorrect assumptions and mistakes here.
First, is, respectfully, a wrong approach on how we should react to emergencies. git.wm.org as a service has owners and I'm not one of them. My reaction was a reaction for a service I know next to nothing about, on a service outage, on a Sunday. The service was down and the first priority is to restore it and making sure it won't die again when I'm not looking at my screen. An overload caused by Googlebot was deemed to be causing this and I decided that Google indexing could be temporarily suspended until the situation was to be properly assesed, by the right people, with a clear head. I still think it was an fair compromise to make.
Second, you're assuming that the reason of the outage was a software bug and that a stacktrace was going to be helpful. I haven't followed up, but I don't think this is correct here -- and even if it was, it's wrong to assume it will always be, that's what post-mortems are for. Preliminary investigation showed the website being crawled for every single patch by every single author, plus stats going back random amounts of time, for all Wikimedia git projects. All software will break under the right amount of load and there's nothing stacktraces can help with here. (rel=nofollow would help, but a stacktrace wouldn't tell you this).
Third, you're assuming that others are not working with gitblit upstream. Chad is already collaborating with gitblit upstream and has done so before this happened. He's already engaging with them on other issues potentially relevant to this outage[1]. He also has an excellent track record with (at least) his previous collaboration with Gerrit upstream. Finally, Ori contributed a patch that was merged for one of the root causes of gitblit outages[2].
Fourth, you're extrapolating my personal "pushing upstream" attitude from a single incident and single interaction with me and from there extrapolating the team's and foundation's attitude and finally reaching to the conclusion that we won't collaborate with upstreams for HTTPS. These are all incorrect extrapolations and wild guesses.
I can tell you for a fact that you couldn't be farther from truth. Both I and others work closely with a large number of upstreams regularly. I've worked with all kinds of upstreams for years, long before I joined the foundation and I'm not planning to stop anytime soon -- it's also part of my job description and of the organizational mandate, so it's much more than my personal modus operandi.
Finally, specifically for the cases you mention: for HTTPS, ironically, I sent a couple of patches to httpd-dev last week for bugs that I found while playing with Apache 2.4 for Wikimedia's SSL cluster. One of them came after discussions with Ryan on SSL sessions security and potential attacks[3]. As for DNS, I worked closely with upstream[4], Brandon, for gdnsd when I was still evaluating it for use by Wikimedia (and I even maintain it in Debian nowadays[5]); Brandon was hired by the foundation a few months later -- it's hard to get better relations with upstream than having them on the team, don't you think?
I hope these address your concerns. If not, I'd be happy to provide more information and take feedback, but please, let's keep it civil, let's keep it technical and let's keep it on-topic and in perspective -- the issue at hand (security & protection from state surveillance) is far too important for us to be discussing the response to a minor gitblit outage in the same thread, IMHO.
Best, Faidon
1: https://code.google.com/p/gitblit/issues/detail?id=274 2: https://github.com/gitblit/gitblit/commit/da71511be5a4d205e571b548d6330ed7f7... 3: http://mail-archives.apache.org/mod_mbox/httpd-dev/201308.mbox/%3C2013080511... 4: https://github.com/blblack/gdnsd/issues/created_by/paravoid?state=closed 5: http://packages.debian.org/sid/gdnsd
wikitech-l@lists.wikimedia.org