Hi,
We've been having spikes in our 5xx error logs since yesterday. There
are definitely multiple distinct causes for those, incl. esams network
issues, random people trying to DoS us, MediaWiki bugs that got
backported yesterday etc.
One of the most peculiar cause of errors, though, are requests of this
form:
GET \\nki/Random_article HTTP/1.1
Host:
en.wikipedia.org
...
That's GET space backslash newline ki/Random_article ("Random_article"
being an example). This makes Varnish think the URL is "\" and
"ki/Random_article HTTP/1.1" some random malformed header and so it
responds with a 503 (and not a 400 -- that's a bug of its own).
The first occurence of such a request in our logs is
2013-11-25T12:03:45. Before that we had 0 (zero) such requests in our
logs, for all of November that I checked. Since then and until now
we've had 83.010 such requests (about 1/3 of our total 5xx).
I've verified those strange requests coming directly to our frontends --
they are not passing through our SSL terminators or special proxies like
Opera Mini. You can see e.g. a sample filtered pcap at
fenari.wikimedia.org:~faidon/malformed-GET-20131126.pcap (this has
private data, do not share). The packets' TCP checksum is obviously
correct.
Those requests always are for
en.wikipedia.org articles, no other
languages or projects. They come from all user-agents & operating
systems (so, probably not a malware). They have all kind of Referers,
including internal links. About 3/4 are coming from Google, but this
isn't irregular. Some of them have proper Cookies, including session
tokens and such (so, probably not just spoofed UAs).
The requests are 83.010, coming from 21.193 unique IPs in 121 different
countries. The distribution by country is the most interesting part;
the top 5 of unique IPs reads:
18152 IN
271 PH
268 AE
228 MY
207 US
i.e. 85% comes from India -but not a particular ISP-, in a >24h
period.
The distribution of hits per datacenter is:
78938 eqiad (incl. 72516 for India)
4072 esams
I've been on this for some time and I'm currently out of ideas.
At this point, the only theory that I have is some popular CPE device
or, alternatively, state surveillance device (e.g. BlueCoat), has gone
haywire and is corrupting HTTP requests (paranoia about state
surveillance was one of the reasons I kept digging). Some parts don't
fit in either theory (traffic is distributed across both DCs & multiple
countries for state surveillance; requests are too targetted to enwiki
for CPEs).
Other thoughts? Am I missing something completely obvious?
Regards,
Faidon