Hi,
We've been having spikes in our 5xx error logs since yesterday. There are definitely multiple distinct causes for those, incl. esams network issues, random people trying to DoS us, MediaWiki bugs that got backported yesterday etc.
One of the most peculiar cause of errors, though, are requests of this form: GET \nki/Random_article HTTP/1.1 Host: en.wikipedia.org ...
That's GET space backslash newline ki/Random_article ("Random_article" being an example). This makes Varnish think the URL is "" and "ki/Random_article HTTP/1.1" some random malformed header and so it responds with a 503 (and not a 400 -- that's a bug of its own).
The first occurence of such a request in our logs is 2013-11-25T12:03:45. Before that we had 0 (zero) such requests in our logs, for all of November that I checked. Since then and until now we've had 83.010 such requests (about 1/3 of our total 5xx).
I've verified those strange requests coming directly to our frontends -- they are not passing through our SSL terminators or special proxies like Opera Mini. You can see e.g. a sample filtered pcap at fenari.wikimedia.org:~faidon/malformed-GET-20131126.pcap (this has private data, do not share). The packets' TCP checksum is obviously correct.
Those requests always are for en.wikipedia.org articles, no other languages or projects. They come from all user-agents & operating systems (so, probably not a malware). They have all kind of Referers, including internal links. About 3/4 are coming from Google, but this isn't irregular. Some of them have proper Cookies, including session tokens and such (so, probably not just spoofed UAs).
The requests are 83.010, coming from 21.193 unique IPs in 121 different countries. The distribution by country is the most interesting part; the top 5 of unique IPs reads: 18152 IN 271 PH 268 AE 228 MY 207 US i.e. 85% comes from India -but not a particular ISP-, in a >24h period.
The distribution of hits per datacenter is: 78938 eqiad (incl. 72516 for India) 4072 esams
I've been on this for some time and I'm currently out of ideas.
At this point, the only theory that I have is some popular CPE device or, alternatively, state surveillance device (e.g. BlueCoat), has gone haywire and is corrupting HTTP requests (paranoia about state surveillance was one of the reasons I kept digging). Some parts don't fit in either theory (traffic is distributed across both DCs & multiple countries for state surveillance; requests are too targetted to enwiki for CPEs).
Other thoughts? Am I missing something completely obvious?
Regards, Faidon
Other thoughts? Am I missing something completely obvious?
Could this be just a chain email that started in India, and has a malformed URL with a newline in it, like this:
http://google-junk-that-is-too-big-to-fit-on-one-line-for-the-author. ..?...target=http://en.wikipedia.org/wi ki/Random_article
On Tue, Nov 26, 2013 at 02:05:53PM -0500, Dan Andreescu wrote:
Other thoughts? Am I missing something completely obvious?
Could this be just a chain email that started in India, and has a malformed URL with a newline in it, like this:
http://google-junk-that-is-too-big-to-fit-on-one-line-for-the-author. ..?...target=http://en.wikipedia.org/wi ki/Random_article
No. A browser would never send a malformed HTTP request, no matter what you put on the URL bar. It could have been a browser bug, but we see this with multiple browsers, so that's not it.
So, to reiterate: - Malformed HTTP from multiple UA strings -> not a browser bug - TCP checksum right -> not a network-level corruption (routers/switches/LVS) - Caught with tcpdump as it enters the server -> not a Varnish bug, unlikely to be a kernel bug - Happens on all Varnish caches even with different configuration -> not a kernel or firmware (e.g. NIC) heisenbug - Only happens for Host: en.wikipedia.org -> suggests something smart enough to understand HTTP headers.
So, it looks like it's something that sits between the browser and our caches, does deep packet inspection, understands HTTP and has a bug somewhere that corrupts it. It could be a malware[1], a buggy CPE, a content inspection/filtering appliance that updated itself, or a state surveillance system malfunction.
The impact is not huge but it's not so small either. It's just very... puzzling :)
Regards, Faidon
1: Malware seemed initially unlikely due to the diversity of User-Agent strings, however further analysis showed all the non-Windows ones -even iPhone & Android- being a tiny percentage of requests so it's on the top of my list now.
FWIW we had bot net 2 years ago that picked Special:Random_article as target. Other than that there is not much similarity.
It caused a doubling of all traffic to Wiktionaries and ~5% extra traffic overall. The bot net struck twice, once to Portuguese Wikitionary only. A few months it spreaded bogus requests over all Wiktionaries. Then it disappeared. Both peaks can still be seen at http://stats.wikimedia.org/wiktionary/EN/TablesPageViewsMonthlyOriginal.htm
Erik
-----Original Message----- From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Faidon Liambotis Sent: Tuesday, November 26, 2013 18:44 To: Operations Engineers; Analytics team Subject: [Analytics] Really strange malformed requests since yesterday
Hi,
We've been having spikes in our 5xx error logs since yesterday. There are definitely multiple distinct causes for those, incl. esams network issues, random people trying to DoS us, MediaWiki bugs that got backported yesterday etc.
One of the most peculiar cause of errors, though, are requests of this form: GET \nki/Random_article HTTP/1.1 Host: en.wikipedia.org ...
That's GET space backslash newline ki/Random_article ("Random_article" being an example). This makes Varnish think the URL is "" and "ki/Random_article HTTP/1.1" some random malformed header and so it responds with a 503 (and not a 400 -- that's a bug of its own).
The first occurence of such a request in our logs is 2013-11-25T12:03:45. Before that we had 0 (zero) such requests in our logs, for all of November that I checked. Since then and until now we've had 83.010 such requests (about 1/3 of our total 5xx).
I've verified those strange requests coming directly to our frontends -- they are not passing through our SSL terminators or special proxies like Opera Mini. You can see e.g. a sample filtered pcap at fenari.wikimedia.org:~faidon/malformed-GET-20131126.pcap (this has private data, do not share). The packets' TCP checksum is obviously correct.
Those requests always are for en.wikipedia.org articles, no other languages or projects. They come from all user-agents & operating systems (so, probably not a malware). They have all kind of Referers, including internal links. About 3/4 are coming from Google, but this isn't irregular. Some of them have proper Cookies, including session tokens and such (so, probably not just spoofed UAs).
The requests are 83.010, coming from 21.193 unique IPs in 121 different countries. The distribution by country is the most interesting part; the top 5 of unique IPs reads: 18152 IN 271 PH 268 AE 228 MY 207 US i.e. 85% comes from India -but not a particular ISP-, in a >24h period.
The distribution of hits per datacenter is: 78938 eqiad (incl. 72516 for India) 4072 esams
I've been on this for some time and I'm currently out of ideas.
At this point, the only theory that I have is some popular CPE device or, alternatively, state surveillance device (e.g. BlueCoat), has gone haywire and is corrupting HTTP requests (paranoia about state surveillance was one of the reasons I kept digging). Some parts don't fit in either theory (traffic is distributed across both DCs & multiple countries for state surveillance; requests are too targetted to enwiki for CPEs).
Other thoughts? Am I missing something completely obvious?
Regards, Faidon
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Tue, Nov 26, 2013 at 07:44:02PM +0200, Faidon Liambotis wrote:
That's GET space backslash newline ki/Random_article ("Random_article" being an example). This makes Varnish think the URL is "" and "ki/Random_article HTTP/1.1" some random malformed header and so it responds with a 503 (and not a 400 -- that's a bug of its own).
I debugged that last part, i.e. why were we emitting a 503 instead of a 400.
So, Varnish parses the request and makes the exact equivalent one to Apaches -- it doesn't grok on the input but is, amazingly, perfectly happy with it. In turn, Apache sees a URL that doesn't start with / or is * and correctly returns a HTTP_BAD_REQUEST (400) internally (server/core.c, line 3560). However, since "HTTP/X.X" is missing from the first line, it interprets it as an HTTP/0.9 request and emits a Bad Request HTML directly, conversing in perfectly valid HTTP/0.9(!).
Only HTTP/0.9 didn't have response headers at all, so that protocol conversation is: -> GET / <- <!DOCTYPE html> <- <html lang=en> <- [...]
Varnish, a piece of software first released in 2006, does not support HTTP/0.9. Therefore, it gets confused by the lack of response headers and interprets this response as a backend error, ultimately throwing a 503.
I checked the Apache source; there is unfortunately no way to increase the minimum supported protocol version to 1.0. I checked the Varnish source as well; the protocol detection seems like a hardcoded sscanf(), but it shouldn't be too difficult to fix. Ideally, we should fix Varnish to drop HTTP/0.9 requests as malformed altogether and obviously not adding backend HTTP/0.9 support. Any takers? :)
I also tested adding a if (req.url == "") { error 400 "Bad request" } to our VCL. It works as intended, but since these wouldn't be logged 1:1 in our log infrastructure, I opted into not adding this snippet.
The first occurence of such a request in our logs is 2013-11-25T12:03:45. Before that we had 0 (zero) such requests in our logs, for all of November that I checked. Since then and until now we've had 83.010 such requests (about 1/3 of our total 5xx).
It seems this is getting gradually over: 5xx.tsv.log-20131125.gz:0 5xx.tsv.log-20131126.gz:26837 5xx.tsv.log-20131127.gz:77202 5xx.tsv.log-20131128.gz:70318 5xx.tsv.log-20131129.gz:13118 5xx.tsv.log-20131130.gz:890
Whatever the cause for this was, breaking access to Wikipedia was probably noticed and fixed by whoever introduced that bug in the first place.
Regards, Faidon