Everyone, I apologize for the bug.
I'll look for ways to guard better against this risk in the future, which will be important as we look to expand coverage of Wikipedia Zero to sister projects and the desktop form factor.
Thanks to everyone for resolving the issue so quickly. You guys rule.
And Roan, thanks for not flipping over my desk, despite the bug making RL go haywire on Wikidata AND holding up your lightning deployment. It's true - you are a gentleman and a scholar.
-Adam
On Wed, Oct 30, 2013 at 5:57 PM, Yuri Astrakhan yastrakhan@wikimedia.orgwrote:
== Background == ZeroRatedMobileAccess has always depended on MobileFrontend and used it liberally, including calls to its classes. However, it was done in hooks called by MF so Zero simply stopped working in absence of MF. This, however, changed in [1] where Zero started using a ResourceLoader module from MF.
== What happened == At 23:02pm UTC, after deploying Zero extension updates, fatal monitor was flooded with:
-- Fatal error: Class 'MFResourceLoaderModule' not found in /usr/local/
apache/common-local/php-1.23wmf1/includes/resourceloader/ResourceLoader.phpon line 408
The issue was tracked down to Wikidata having MobileFrontend disabled, while ZeroRatedMobileAccess was enabled. It didn't impact page views directly, however all load.php calls that requested the startup module caused fatals because it attempted to instantiate MFResourceLoader class and couldn't find it. As a consequence, people might have seen pages without styles or scripts.
A number of people (MaxSem, Reedy, Roan, and Greg, and possibly others) gave great assistance to track down the issue and rapidly disable the ZeroRatedMobileAccess extension in Wikidata. Furthermore, mobile configuration [2] will add an additional guard against calling ZeroRatedMobileAccess.php unless it's explicitly within the context of MF.
Thank you to everyone!!!
== Timeline == All times in UTC
- 22:48 Zero 1.22wmf22 deployed, no errors
- 23:02 Zero 1.23wmf1 deployed, first errors appear - initially unnoticed
- 23:08 A small MobileFrontend change deployed
- 23:09 Errors noticed, initially linked with MobileFrontend push
- 23:17 Max reverts his MobileFrontend changes, errors don't go away
- 23:22 Problem narrowed down
- 23:27 Fix deployed
== Recomendations ==
- Allow a bit more time between deployments and observe fatalmonitor before
and after
- Ensure Zero extension checks if Mobile extension is loaded before
enabling itself if it relies on MFResourceLoader.
[1] https://gerrit.wikimedia.org/r/#/c/83133 [2] https://gerrit.wikimedia.org/r/#/c/92811 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
With all my prep work completed ahead of time; I can get a CentralNotice LD out to both production branches in about 15 minutes (waiting on the Jenkins merge is the longest bit of that.) I watch both the fatal and exception logs whilst doing it and then quickly run through the patches to make sure it's all working.
I've felt pressured in the LD to get stuff out and myself out of the way when there have been more than two people in it -- which does correlate with my 15 minute estimate for the fastest I feel I can safely deploy.
~Matt Walker Wikimedia Foundation Fundraising Technology Team
On Thu, Oct 31, 2013 at 7:53 AM, Adam Baso abaso@wikimedia.org wrote:
Everyone, I apologize for the bug.
I'll look for ways to guard better against this risk in the future, which will be important as we look to expand coverage of Wikipedia Zero to sister projects and the desktop form factor.
Thanks to everyone for resolving the issue so quickly. You guys rule.
And Roan, thanks for not flipping over my desk, despite the bug making RL go haywire on Wikidata AND holding up your lightning deployment. It's true
- you are a gentleman and a scholar.
-Adam
On Wed, Oct 30, 2013 at 5:57 PM, Yuri Astrakhan yastrakhan@wikimedia.orgwrote:
== Background == ZeroRatedMobileAccess has always depended on MobileFrontend and used it liberally, including calls to its classes. However, it was done in hooks called by MF so Zero simply stopped working in absence of MF. This, however, changed in [1] where Zero started using a ResourceLoader module from MF.
== What happened == At 23:02pm UTC, after deploying Zero extension updates, fatal monitor was flooded with:
-- Fatal error: Class 'MFResourceLoaderModule' not found in /usr/local/
apache/common-local/php-1.23wmf1/includes/resourceloader/ResourceLoader.phpon line 408
The issue was tracked down to Wikidata having MobileFrontend disabled, while ZeroRatedMobileAccess was enabled. It didn't impact page views directly, however all load.php calls that requested the startup module caused fatals because it attempted to instantiate MFResourceLoader class and couldn't find it. As a consequence, people might have seen pages without styles or scripts.
A number of people (MaxSem, Reedy, Roan, and Greg, and possibly others) gave great assistance to track down the issue and rapidly disable the ZeroRatedMobileAccess extension in Wikidata. Furthermore, mobile configuration [2] will add an additional guard against calling ZeroRatedMobileAccess.php unless it's explicitly within the context of MF.
Thank you to everyone!!!
== Timeline == All times in UTC
- 22:48 Zero 1.22wmf22 deployed, no errors
- 23:02 Zero 1.23wmf1 deployed, first errors appear - initially unnoticed
- 23:08 A small MobileFrontend change deployed
- 23:09 Errors noticed, initially linked with MobileFrontend push
- 23:17 Max reverts his MobileFrontend changes, errors don't go away
- 23:22 Problem narrowed down
- 23:27 Fix deployed
== Recomendations ==
- Allow a bit more time between deployments and observe fatalmonitor
before and after
- Ensure Zero extension checks if Mobile extension is loaded before
enabling itself if it relies on MFResourceLoader.
[1] https://gerrit.wikimedia.org/r/#/c/83133 [2] https://gerrit.wikimedia.org/r/#/c/92811 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l
Speaking of exception log, I personally use https://gerrit.wikimedia.org/r/38252 to monitor it, unfortunately it's still not reviewed for everyone to use :P
On Thu, Oct 31, 2013 at 9:08 PM, Matthew Walker mwalker@wikimedia.orgwrote:
With all my prep work completed ahead of time; I can get a CentralNotice LD out to both production branches in about 15 minutes (waiting on the Jenkins merge is the longest bit of that.) I watch both the fatal and exception logs whilst doing it and then quickly run through the patches to make sure it's all working.
I've felt pressured in the LD to get stuff out and myself out of the way when there have been more than two people in it -- which does correlate with my 15 minute estimate for the fastest I feel I can safely deploy.
~Matt Walker Wikimedia Foundation Fundraising Technology Team
On Thu, Oct 31, 2013 at 7:53 AM, Adam Baso abaso@wikimedia.org wrote:
Everyone, I apologize for the bug.
I'll look for ways to guard better against this risk in the future, which will be important as we look to expand coverage of Wikipedia Zero to sister projects and the desktop form factor.
Thanks to everyone for resolving the issue so quickly. You guys rule.
And Roan, thanks for not flipping over my desk, despite the bug making RL go haywire on Wikidata AND holding up your lightning deployment. It's true
- you are a gentleman and a scholar.
-Adam
On Wed, Oct 30, 2013 at 5:57 PM, Yuri Astrakhan <yastrakhan@wikimedia.org
wrote:
== Background == ZeroRatedMobileAccess has always depended on MobileFrontend and used it liberally, including calls to its classes. However, it was done in hooks called by MF so Zero simply stopped working in absence of MF. This, however, changed in [1] where Zero started using a ResourceLoader module from MF.
== What happened == At 23:02pm UTC, after deploying Zero extension updates, fatal monitor was flooded with:
-- Fatal error: Class 'MFResourceLoaderModule' not found in /usr/local/
apache/common-local/php-1.23wmf1/includes/resourceloader/ResourceLoader.phpon line 408
The issue was tracked down to Wikidata having MobileFrontend disabled, while ZeroRatedMobileAccess was enabled. It didn't impact page views directly, however all load.php calls that requested the startup module caused fatals because it attempted to instantiate MFResourceLoader class and couldn't find it. As a consequence, people might have seen pages without styles or scripts.
A number of people (MaxSem, Reedy, Roan, and Greg, and possibly others) gave great assistance to track down the issue and rapidly disable the ZeroRatedMobileAccess extension in Wikidata. Furthermore, mobile configuration [2] will add an additional guard against calling ZeroRatedMobileAccess.php unless it's explicitly within the context of MF.
Thank you to everyone!!!
== Timeline == All times in UTC
- 22:48 Zero 1.22wmf22 deployed, no errors
- 23:02 Zero 1.23wmf1 deployed, first errors appear - initially unnoticed
- 23:08 A small MobileFrontend change deployed
- 23:09 Errors noticed, initially linked with MobileFrontend push
- 23:17 Max reverts his MobileFrontend changes, errors don't go away
- 23:22 Problem narrowed down
- 23:27 Fix deployed
== Recomendations ==
- Allow a bit more time between deployments and observe fatalmonitor
before and after
- Ensure Zero extension checks if Mobile extension is loaded before
enabling itself if it relies on MFResourceLoader.
[1] https://gerrit.wikimedia.org/r/#/c/83133 [2] https://gerrit.wikimedia.org/r/#/c/92811 _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Mobile-l mailing list Mobile-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mobile-l
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
Le 31/10/13 18:10, Max Semenik a écrit :
Speaking of exception log, I personally use https://gerrit.wikimedia.org/r/38252 to monitor it, unfortunately it's still not reviewed for everyone to use :P
Max, rebase that change and lets get it reviewed/merged in :-]
<quote name="Matthew Walker" date="2013-10-31" time="10:08:05 -0700">
I've felt pressured in the LD to get stuff out and myself out of the way when there have been more than two people in it -- which does correlate with my 15 minute estimate for the fastest I feel I can safely deploy.
Thanks for that perspective, Matt.
I think that is a reasonable cut off (15 minutes per LD participant, thus 2 per LD window) that will still allow people to use the LDs but also keep our sanity (and site) safe.
edited LD page: https://wikitech.wikimedia.org/wiki/Lightning_deployments
Greg
I think we should change 2 things about LDs:
1) Move them at least 1 hour earlier so that we never end in the situation when someone deploys a change and goes home. This should be easier now that more teams are jumping on the train, thus giving up their windows. 2) Extend lightning deployments to 1 full hour. This will vastly reduce any possibility of rush, thus reducing possible errors. This will also allow 3-4 different people to deploy safely instead of proposed 2.
On Thu, Oct 31, 2013 at 10:00 PM, Greg Grossmeier greg@wikimedia.orgwrote:
<quote name="Matthew Walker" date="2013-10-31" time="10:08:05 -0700"> > I've felt pressured in the LD to get stuff out and myself out of the way > when there have been more than two people in it -- which does correlate > with my 15 minute estimate for the fastest I feel I can safely deploy.
Thanks for that perspective, Matt.
I think that is a reasonable cut off (15 minutes per LD participant, thus 2 per LD window) that will still allow people to use the LDs but also keep our sanity (and site) safe.
edited LD page: https://wikitech.wikimedia.org/wiki/Lightning_deployments
Greg
-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |
Engineering mailing list Engineering@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/engineering
On Thu, Oct 31, 2013 at 2:48 PM, Max Semenik msemenik@wikimedia.org wrote:
- Move them at least 1 hour earlier so that we never end in the situation
when someone deploys a change and goes home.
With people in all different timezones, I don't think "never having someone deploy then go home" is going to be possible. For myself, for example, even if we move the LD an hour earlier I'll still normally have been "home" for an hour before LD starts.
Le 31/10/13 21:19, Brad Jorsch (Anomie) a écrit :
On Thu, Oct 31, 2013 at 2:48 PM, Max Semenik msemenik@wikimedia.org wrote:
- Move them at least 1 hour earlier so that we never end in the situation
when someone deploys a change and goes home.
With people in all different timezones, I don't think "never having someone deploy then go home" is going to be possible. For myself, for example, even if we move the LD an hour earlier I'll still normally have been "home" for an hour before LD starts.
I remember doing changes after my lunch when Chad (then on east coast) was enjoying is morning coffee.
So what about opening a second LD slot earlier in the day? The European afternoons and US east coast morning nicely overlap.