On Tue, Nov 26, 2013 at 07:44:02PM +0200, Faidon Liambotis wrote:
That's GET space backslash newline ki/Random_article ("Random_article" being an example). This makes Varnish think the URL is "" and "ki/Random_article HTTP/1.1" some random malformed header and so it responds with a 503 (and not a 400 -- that's a bug of its own).
I debugged that last part, i.e. why were we emitting a 503 instead of a 400.
So, Varnish parses the request and makes the exact equivalent one to Apaches -- it doesn't grok on the input but is, amazingly, perfectly happy with it. In turn, Apache sees a URL that doesn't start with / or is * and correctly returns a HTTP_BAD_REQUEST (400) internally (server/core.c, line 3560). However, since "HTTP/X.X" is missing from the first line, it interprets it as an HTTP/0.9 request and emits a Bad Request HTML directly, conversing in perfectly valid HTTP/0.9(!).
Only HTTP/0.9 didn't have response headers at all, so that protocol conversation is: -> GET / <- <!DOCTYPE html> <- <html lang=en> <- [...]
Varnish, a piece of software first released in 2006, does not support HTTP/0.9. Therefore, it gets confused by the lack of response headers and interprets this response as a backend error, ultimately throwing a 503.
I checked the Apache source; there is unfortunately no way to increase the minimum supported protocol version to 1.0. I checked the Varnish source as well; the protocol detection seems like a hardcoded sscanf(), but it shouldn't be too difficult to fix. Ideally, we should fix Varnish to drop HTTP/0.9 requests as malformed altogether and obviously not adding backend HTTP/0.9 support. Any takers? :)
I also tested adding a if (req.url == "") { error 400 "Bad request" } to our VCL. It works as intended, but since these wouldn't be logged 1:1 in our log infrastructure, I opted into not adding this snippet.
The first occurence of such a request in our logs is 2013-11-25T12:03:45. Before that we had 0 (zero) such requests in our logs, for all of November that I checked. Since then and until now we've had 83.010 such requests (about 1/3 of our total 5xx).
It seems this is getting gradually over: 5xx.tsv.log-20131125.gz:0 5xx.tsv.log-20131126.gz:26837 5xx.tsv.log-20131127.gz:77202 5xx.tsv.log-20131128.gz:70318 5xx.tsv.log-20131129.gz:13118 5xx.tsv.log-20131130.gz:890
Whatever the cause for this was, breaking access to Wikipedia was probably noticed and fixed by whoever introduced that bug in the first place.
Regards, Faidon