Hi,
We've been having spikes in our 5xx error logs since yesterday. There are definitely multiple distinct causes for those, incl. esams network issues, random people trying to DoS us, MediaWiki bugs that got backported yesterday etc.
One of the most peculiar cause of errors, though, are requests of this form: GET \nki/Random_article HTTP/1.1 Host: en.wikipedia.org ...
That's GET space backslash newline ki/Random_article ("Random_article" being an example). This makes Varnish think the URL is "" and "ki/Random_article HTTP/1.1" some random malformed header and so it responds with a 503 (and not a 400 -- that's a bug of its own).
The first occurence of such a request in our logs is 2013-11-25T12:03:45. Before that we had 0 (zero) such requests in our logs, for all of November that I checked. Since then and until now we've had 83.010 such requests (about 1/3 of our total 5xx).
I've verified those strange requests coming directly to our frontends -- they are not passing through our SSL terminators or special proxies like Opera Mini. You can see e.g. a sample filtered pcap at fenari.wikimedia.org:~faidon/malformed-GET-20131126.pcap (this has private data, do not share). The packets' TCP checksum is obviously correct.
Those requests always are for en.wikipedia.org articles, no other languages or projects. They come from all user-agents & operating systems (so, probably not a malware). They have all kind of Referers, including internal links. About 3/4 are coming from Google, but this isn't irregular. Some of them have proper Cookies, including session tokens and such (so, probably not just spoofed UAs).
The requests are 83.010, coming from 21.193 unique IPs in 121 different countries. The distribution by country is the most interesting part; the top 5 of unique IPs reads: 18152 IN 271 PH 268 AE 228 MY 207 US i.e. 85% comes from India -but not a particular ISP-, in a >24h period.
The distribution of hits per datacenter is: 78938 eqiad (incl. 72516 for India) 4072 esams
I've been on this for some time and I'm currently out of ideas.
At this point, the only theory that I have is some popular CPE device or, alternatively, state surveillance device (e.g. BlueCoat), has gone haywire and is corrupting HTTP requests (paranoia about state surveillance was one of the reasons I kept digging). Some parts don't fit in either theory (traffic is distributed across both DCs & multiple countries for state surveillance; requests are too targetted to enwiki for CPEs).
Other thoughts? Am I missing something completely obvious?
Regards, Faidon