New subject: [Ops] Really strange malformed requests since yesterday

26 Nov 2013


      Hi,
We've been having spikes in our 5xx error logs since yesterday. There 
are definitely multiple distinct causes for those, incl. esams network 
issues, random people trying to DoS us, MediaWiki bugs that got 
backported yesterday etc.
One of the most peculiar cause of errors, though, are requests of this 
form:
   GET \nki/Random_article HTTP/1.1
   Host: en.wikipedia.org
   ...
That's GET space backslash newline ki/Random_article ("Random_article" 
being an example). This makes Varnish think the URL is "" and 
"ki/Random_article HTTP/1.1" some random malformed header and so it 
responds with a 503 (and not a 400 -- that's a bug of its own).
The first occurence of such a request in our logs is 
2013-11-25T12:03:45.  Before that we had 0 (zero) such requests in our 
logs, for all of November that I checked.  Since then and until now 
we've had 83.010 such requests (about 1/3 of our total 5xx).
I've verified those strange requests coming directly to our frontends -- 
they are not passing through our SSL terminators or special proxies like 
Opera Mini.  You can see e.g. a sample filtered pcap at 
fenari.wikimedia.org:~faidon/malformed-GET-20131126.pcap (this has 
private data, do not share). The packets' TCP checksum is obviously 
correct.
Those requests always are for en.wikipedia.org articles, no other 
languages or projects. They come from all user-agents & operating 
systems (so, probably not a malware). They have all kind of Referers, 
including internal links. About 3/4 are coming from Google, but this 
isn't irregular. Some of them have proper Cookies, including session 
tokens and such (so, probably not just spoofed UAs).
The requests are 83.010, coming from 21.193 unique IPs in 121 different 
countries.  The distribution by country is the most interesting part; 
the top 5 of unique IPs reads:
  18152 IN
    271 PH
    268 AE
    228 MY
    207 US
i.e. 85% comes from India -but not a particular ISP-, in a >24h
period.
The distribution of hits per datacenter is:
   78938 eqiad (incl. 72516 for India)
    4072 esams
I've been on this for some time and I'm currently out of ideas.
At this point, the only theory that I have is some popular CPE device 
or, alternatively, state surveillance device (e.g. BlueCoat), has gone 
haywire and is corrupting HTTP requests (paranoia about state 
surveillance was one of the reasons I kept digging). Some parts don't 
fit in either theory (traffic is distributed across both DCs & multiple 
countries for state surveillance; requests are too targetted to enwiki 
for CPEs).
Other thoughts? Am I missing something completely obvious?
Regards,
Faidon