https://bugzilla.wikimedia.org/show_bug.cgi?id=64622
Error generating thumbnail: As an anti-spam measure, you are limited from performing this action too many times
Multimedia Team also knows about it (and I pinged them about it yesterday), but there hasn't been any movement.
Anyone want to get the ball rolling for them? Please.
Greg
Le 06/05/2014 05:16, Greg Grossmeier a écrit :
https://bugzilla.wikimedia.org/show_bug.cgi?id=64622
Error generating thumbnail: As an anti-spam measure, you are limited from performing this action too many times
Multimedia Team also knows about it (and I pinged them about it yesterday), but there hasn't been any movement.
Anyone want to get the ball rolling for them? Please.
Hello,
By looking at the udp2log limiter.log file, the renderfile-nonstandard limit is reached by:
$ fgrep renderfile limiter.log |cut -d: -f4|sort|uniq -c|sort -n 378 10.64.0.168 tripped! mediawiki 405 10.64.0.167 tripped! mediawiki 476 10.64.32.92 tripped! mediawiki 498 10.64.16.150 tripped! mediawiki $
They are the media server frontends ms-fe1001 to ms-fe1004. We probably want to restrict the end user IP instead.
I suspect the media servers are not properly passing the X-Forwarded-For header down to the thumbnail renderer. Seems the logic is in operations/puppet.git file ./files/swift/SwiftMedia/wmf/rewrite.py
Would need someone with more informations about Swift/Thumb handling than me :-(
Le 06/05/2014 10:09, Antoine Musso a écrit :
Would need someone with more informations about Swift/Thumb handling than me :-(
Fixed up by Faidon by adding the ms-fe box to the list of trusted XFF proxies: https://gerrit.wikimedia.org/r/131670
<quote name="Antoine Musso" date="2014-05-06" time="11:03:15 +0200">
Le 06/05/2014 10:09, Antoine Musso a écrit :
Would need someone with more informations about Swift/Thumb handling than me :-(
Fixed up by Faidon by adding the ms-fe box to the list of trusted XFF proxies: https://gerrit.wikimedia.org/r/131670
Thanks a ton Antoine (and Faidon)!.
Greg
On Tue, May 6, 2014 at 7:56 AM, Greg Grossmeier greg@wikimedia.org wrote:
<quote name="Antoine Musso" date="2014-05-06" time="11:03:15 +0200"> > Le 06/05/2014 10:09, Antoine Musso a écrit : > > Would need someone with more informations about Swift/Thumb handling > > than me :-( > > Fixed up by Faidon by adding the ms-fe box to the list of trusted XFF > proxies: https://gerrit.wikimedia.org/r/131670
Thanks a ton Antoine (and Faidon)!.
Yup, thanks folks!
This was probably enough of an outage that we should write up a brief postmortem for this. Who feels like they understand the situation well enough to do this?
Rob
Le 06/05/2014 18:56, Rob Lanphier a écrit :
Yup, thanks folks!
This was probably enough of an outage that we should write up a brief postmortem for this. Who feels like they understand the situation well enough to do this?
Ccing Faidon who provided the fix.
This is rather long and is public since mediawiki-core list is publicly available. We might want a shorter postmortem on the wiki.
To long, not willing to read: our MediaWiki configuration did not trust X-Forwarded-For headers sent by the media server and ended up throttling them instead of the end users.
== Timeline of events ==
Saturday May 3rd 15:00 UTC community member Yann Forget reopened [bug 64622] (although unrelated) mentioning that thumbnails rendering fails with the error message:
"Error generating thumbnail
As an anti-spam measure, you are limited from performing this action too many times in a short space of time, and you have exceeded this limit. Please try again in a few minutes."
Nemo Bis confirmed a few hours after that the error is widespread and impacts commons as well. That is not surprising since the thumbnails rendering infrastructure is shared among wiki projects.
On Sunday May 4th, Nemo mentions some rate limiting configuration has been added for the "renderfile-nonstandard" throttle and deployed on April 21st ( https://gerrit.wikimedia.org/r/#/c/127643/ )
Tuesday May 5th 3:00 UTC Greg emits an email to the MediaWiki-core list escalating the bug report.
Around 8:00 UTC I read the email and do some preliminary investigation. Since Nemo mentioned the 'renderfile-nonstandard' throttle I grep our limiter.log file to find out what is hitting the limit:
$ fgrep renderfile limiter.log |cut -d: -f4|sort|uniq -c|sort -n 378 10.64.0.168 tripped! mediawiki 405 10.64.0.167 tripped! mediawiki 476 10.64.32.92 tripped! mediawiki 498 10.64.16.150 tripped! mediawiki $
The IP are the media server frontends ms-fe1001 to ms-fe1004 which relay thumbnails generation requests on behalf of users. We want to throttle per end user client IP or username, NOT the proxies.
My assumption is that the media servers do not relay the end user X-Forwarded-For header which contains the end user IP.
I look at the Swift proxy rewrite.py script and find out that X-Forwarded-For seems to be handled in there. I ping Faidon and he immediately confirmed X-Forwaded-For header is properly handled by the Swift proxies via rewrite.py and suggest that the media servers are not listed in MediaWiki configuration as trusted proxies.
On May 5th 9:00 UTC we add the media servers to the list of trusted proxies: https://gerrit.wikimedia.org/r/#/c/131670/2/wmf-config/squid.php,unified
Additionally add a comment in Puppet configuration to remember folks to update the trusted proxy list whenever a new proxy is added:
https://gerrit.wikimedia.org/r/#/c/131671/1/manifests/site.pp,unified
The media servers disappear from limiter.log
== how it happened ==
The first occurrence of a media server being tripped by renderfile-nonstandard occurs on 2014-04-29 18:04:05
Nemo detected the throttle configuration has been added to mediawiki-config on April 21st and deployed that day: https://gerrit.wikimedia.org/r/#/c/127643/
The throttle has been merged in mediawiki/core master branch on April 21st as well https://gerrit.wikimedia.org/r/126052 sha1: 4f2324fc
MediaWiki change landed in 1.24wmf2 which $ git branch --remote --contains 4f2324fc gerrit/master gerrit/wmf/1.24wmf2 gerrit/wmf/1.24wmf3
The 1.24wmf2 deployed the change on April 30st 18:03:
18:03 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.24wmf2
"Instantly" causing the media server to be throttled.
== What can be improved ==
The throttling is handled by User->pingLimiter() which has a profile in entry point. Hence it generates a count metric in graphite as MediaWiki.User.pingLimiter.count
I have generated a graph from January 1st 2014 till today and uploaded it on wikitech:
https://wikitech.wikimedia.org/wiki/File:Wmfcluster-user_pinglimiter-2014010...
It shows an increase on March 26th which is unrelated and second increase on April 30th.
1) we might want to generate finer metrics by adding the pingLimiter() action to the wfProfileIn() call.
2) a graph in gdash and a monitoring alarm could be added whenever the rate change significantly.
3) it took us too long (3 days) to get informed about that outage though as soon as the proper folks have been made aware of it it got promptly solved.
4) the misconfiguration was hard to spot until we actually got hit by it :-/
== References ==
[bug 64622] https://bugzilla.wikimedia.org/64622 Trusting XFF from media servers: https://gerrit.wikimedia.org/r/131670
<quote name="Antoine Musso" date="2014-05-07" time="11:17:48 +0200">
Le 06/05/2014 18:56, Rob Lanphier a écrit :
Yup, thanks folks!
This was probably enough of an outage that we should write up a brief postmortem for this. Who feels like they understand the situation well enough to do this?
Ccing Faidon who provided the fix.
This is rather long and is public since mediawiki-core list is publicly available. We might want a shorter postmortem on the wiki.
Decided to put the whole thing on-wiki: https://wikitech.wikimedia.org/wiki/Incident_documentation/20140503-Thumbnai...
Can someone take a stab at creating bugs/RT tickets for the things in "What can be improved" section? I'll try to later but either Antoine or Faidon might be able to give better context.
Le 07/05/2014 17:52, Greg Grossmeier a écrit :
<quote name="Antoine Musso" date="2014-05-07" time="11:17:48 +0200"> > Le 06/05/2014 18:56, Rob Lanphier a écrit : >> Yup, thanks folks! >> >> This was probably enough of an outage that we should write up a brief >> postmortem for this. Who feels like they understand the situation >> well enough to do this? > > Ccing Faidon who provided the fix. > > This is rather long and is public since mediawiki-core list is publicly > available. We might want a shorter postmortem on the wiki.
Decided to put the whole thing on-wiki: https://wikitech.wikimedia.org/wiki/Incident_documentation/20140503-Thumbnai...
Can someone take a stab at creating bugs/RT tickets for the things in "What can be improved" section? I'll try to later but either Antoine or Faidon might be able to give better context.
Hello,
I have filled two bugs in Bugzilla:
Bug 65477 - User::pingLimiter should have per action profiling https://bugzilla.wikimedia.org/show_bug.cgi?id=65477
Trivially solved by https://gerrit.wikimedia.org/r/134067
Bug 65478 - Graph User::pingLimiter() actions in gdash https://bugzilla.wikimedia.org/show_bug.cgi?id=65478
I have no real clue how to set a gdash dashboard :-(
Will still have to figure out how to emit alarms when a specific action is being throttled out of normal.
cheers,
mediawiki-core@lists.wikimedia.org