Le 06/05/2014 18:56, Rob Lanphier a écrit :
Yup, thanks folks!
This was probably enough of an outage that we should write up a brief
postmortem for this. Who feels like they understand the situation
well enough to do this?
Ccing Faidon who provided the fix.
This is rather long and is public since mediawiki-core list is publicly
available. We might want a shorter postmortem on the wiki.
To long, not willing to read: our MediaWiki configuration did not trust
X-Forwarded-For headers sent by the media server and ended up throttling
them instead of the end users.
== Timeline of events ==
Saturday May 3rd 15:00 UTC community member Yann Forget reopened [bug
64622] (although unrelated) mentioning that thumbnails rendering fails
with the error message:
"Error generating thumbnail
As an anti-spam measure, you are limited from performing this action too
many times in a short space of time, and you have exceeded this limit.
Please try again in a few minutes."
Nemo Bis confirmed a few hours after that the error is widespread and
impacts commons as well. That is not surprising since the thumbnails
rendering infrastructure is shared among wiki projects.
On Sunday May 4th, Nemo mentions some rate limiting configuration has
been added for the "renderfile-nonstandard" throttle and deployed on
April 21st (
https://gerrit.wikimedia.org/r/#/c/127643/ )
Tuesday May 5th 3:00 UTC Greg emits an email to the MediaWiki-core list
escalating the bug report.
Around 8:00 UTC I read the email and do some preliminary investigation.
Since Nemo mentioned the 'renderfile-nonstandard' throttle I grep our
limiter.log file to find out what is hitting the limit:
$ fgrep renderfile limiter.log |cut -d\: -f4|sort|uniq -c|sort -n
378 10.64.0.168 tripped! mediawiki
405 10.64.0.167 tripped! mediawiki
476 10.64.32.92 tripped! mediawiki
498 10.64.16.150 tripped! mediawiki
$
The IP are the media server frontends ms-fe1001 to ms-fe1004 which relay
thumbnails generation requests on behalf of users. We want to throttle
per end user client IP or username, NOT the proxies.
My assumption is that the media servers do not relay the end user
X-Forwarded-For header which contains the end user IP.
I look at the Swift proxy rewrite.py script and find out that
X-Forwarded-For seems to be handled in there. I ping Faidon and he
immediately confirmed X-Forwaded-For header is properly handled by the
Swift proxies via rewrite.py and suggest that the media servers are not
listed in MediaWiki configuration as trusted proxies.
On May 5th 9:00 UTC we add the media servers to the list of trusted proxies:
https://gerrit.wikimedia.org/r/#/c/131670/2/wmf-config/squid.php,unified
Additionally add a comment in Puppet configuration to remember folks to
update the trusted proxy list whenever a new proxy is added:
https://gerrit.wikimedia.org/r/#/c/131671/1/manifests/site.pp,unified
The media servers disappear from limiter.log
== how it happened ==
The first occurrence of a media server being tripped by
renderfile-nonstandard occurs on 2014-04-29 18:04:05
Nemo detected the throttle configuration has been added to
mediawiki-config on April 21st and deployed that day:
https://gerrit.wikimedia.org/r/#/c/127643/
The throttle has been merged in mediawiki/core master branch on April
21st as well
https://gerrit.wikimedia.org/r/126052 sha1: 4f2324fc
MediaWiki change landed in 1.24wmf2 which
$ git branch --remote --contains 4f2324fc
gerrit/master
gerrit/wmf/1.24wmf2
gerrit/wmf/1.24wmf3
The 1.24wmf2 deployed the change on April 30st 18:03:
18:03 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized
wikiversions files: Non wikipedias to 1.24wmf2
"Instantly" causing the media server to be throttled.
== What can be improved ==
The throttling is handled by User->pingLimiter() which has a profile in
entry point. Hence it generates a count metric in graphite as
MediaWiki.User.pingLimiter.count
I have generated a graph from January 1st 2014 till today and uploaded
it on wikitech:
https://wikitech.wikimedia.org/wiki/File:Wmfcluster-user_pinglimiter-201401…
It shows an increase on March 26th which is unrelated and second
increase on April 30th.
1) we might want to generate finer metrics by adding the pingLimiter()
action to the wfProfileIn() call.
2) a graph in gdash and a monitoring alarm could be added whenever the
rate change significantly.
3) it took us too long (3 days) to get informed about that outage though
as soon as the proper folks have been made aware of it it got promptly
solved.
4) the misconfiguration was hard to spot until we actually got hit by it :-/
== References ==
[bug 64622]
https://bugzilla.wikimedia.org/64622
Trusting XFF from media servers:
https://gerrit.wikimedia.org/r/131670
--
Antoine "hashar" Musso