Hi,
We're currently in the process of upgrading the MediaWiki servers to Debian Buster and expect a performance regression to come with it.
The cause appears to be better Spectre[1] mitigations in the Buster 4.19 kernel, which we can't disable. Most of the effect is seen in code that ends up invoking syscalls like filemtime, file_get_contents, etc.
I posted some numbers and charts on the Phabricator investigation ticket[2]. For normal requests it looks like ~5% worse for p50/p75 and around ~13% for p95/p99. API requests look much worse, at 10% for p50 22% for p75.
What now? We're going to continue with the upgrade as planned, but we also need help to try and make some performance improvements to reduce the impact of the regression.
The PHP profiling flamegraphs[3] are a great tool to use to identify potentially slow spots. We now also have flamegraphs that only contain Buster requests. I created a set of differential flamegraphs[4] that compare Stretch vs Buster so you can see what specific areas slowed down.
You can also use WikimediaDebug/XHGui[5] to profile a specific request. mwdebug1001/mwdebug1002 are Stretch and mwdebug1003 is Buster.
If you have questions or suggestions please ask or let us know. Thanks to everyone who helped with the investigation and those who've started working on improvements already.
[1] https://en.wikipedia.org/wiki/Spectre_(security_vulnerability) [2] https://phabricator.wikimedia.org/T273312#6802330 [3] https://performance.wikimedia.org/php-profiling/ [4] https://people.wikimedia.org/~legoktm/T273312/data/clean/images/flamegraphs/ [5] https://wikitech.wikimedia.org/wiki/WikimediaDebug#Request_profiling
-- Kunal
Users who would like to follow the upgrade status / want exact information which server is currently on which distro version are welcome to do so at:
https://docs.google.com/spreadsheets/d/1Ris18-joRFfd3OHjGJIraVUk-bpmIRORsPom...
It also tells you which servers have the special roles of scap proxy, mcrouter proxy, canary, and which are VMs (just mwdebug).
There is currently one debug server on buster (mwdebug1003) but we are going to provide the full set soon (https://phabricator.wikimedia.org/T274023).
For canary servers we are aiming to have both for the transitional period and the situation is currently as follows:
mw1261.eqiad.wmnet stretch mw1262.eqiad.wmnet stretch mw1263.eqiad.wmnet buster mw1264.eqiad.wmnet buster mw1265.eqiad.wmnet buster
Additionally one appserver (mw1403) and one API server (mw1402) on new hardware have been designated to stay on stretch until the end to allow for comparisons.
We appreciate reports of any issues just showing up on buster servers of all types (app, API, jobrunner/videoscaler).
Hi,
On 2/3/21 5:35 PM, Kunal Mehta wrote:
What now? We're going to continue with the upgrade as planned, but we also need help to try and make some performance improvements to reduce the impact of the regression.
A week later I'd like to highlight and recognize some of the performance improvements that have been made:
* Upgrading utfnormal to use native mbstring functions instead of PHP implementations https://phabricator.wikimedia.org/T273338 (MaxSem, James F, Reedy and myself) * Optimizations to ApiResult https://gerrit.wikimedia.org/r/q/hashtag:%2522faster-apiresult%2522 (Daimona, Thiemo, Krinkle and James F) * Using PCRE for faster UTF-8 validation in Parsoid https://gerrit.wikimedia.org/r/656596 (Skizzerz and cscott) * Reducing the size of the ExtensionRegistry cache in APCU https://gerrit.wikimedia.org/r/q/hashtag:%2522smaller-extension-cache%2522 (Krinkle and myself) * Reduce impact of HookContainer loading 500+ interfaces https://phabricator.wikimedia.org/T274041 (Skizzerz, myself, Tim Starling and Ori)
If I missed any other improvements people have been working on, my apologies, please share them! I've been using the Gerrit hashtag "faster-mw-plz" https://gerrit.wikimedia.org/r/q/hashtag:faster-mw-plz to try and track these.
-- Kunal
P.S. reimaging to Buster is 70% complete now.
These are amazing, thanks for sharing. /me bookmarks patches for bedtime reading
On Fri, Feb 12, 2021 at 04:25 Kunal Mehta legoktm@member.fsf.org wrote:
Hi,
On 2/3/21 5:35 PM, Kunal Mehta wrote:
What now? We're going to continue with the upgrade as planned, but we also need help to try and make some performance improvements to reduce the impact of the regression.
A week later I'd like to highlight and recognize some of the performance improvements that have been made:
- Upgrading utfnormal to use native mbstring functions instead of PHP
implementations https://phabricator.wikimedia.org/T273338 (MaxSem, James F, Reedy and myself)
- Optimizations to ApiResult
https://gerrit.wikimedia.org/r/q/hashtag:%2522faster-apiresult%2522 (Daimona, Thiemo, Krinkle and James F)
- Using PCRE for faster UTF-8 validation in Parsoid
https://gerrit.wikimedia.org/r/656596 (Skizzerz and cscott)
- Reducing the size of the ExtensionRegistry cache in APCU
< https://gerrit.wikimedia.org/r/q/hashtag:%2522smaller-extension-cache%2522%3...
(Krinkle and myself)
- Reduce impact of HookContainer loading 500+ interfaces
https://phabricator.wikimedia.org/T274041 (Skizzerz, myself, Tim Starling and Ori)
If I missed any other improvements people have been working on, my apologies, please share them! I've been using the Gerrit hashtag "faster-mw-plz" https://gerrit.wikimedia.org/r/q/hashtag:faster-mw-plz to try and track these.
-- Kunal
P.S. reimaging to Buster is 70% complete now.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi all, one final follow-up,
It's been a while since 99% of appservers are on buster but we had still kept 1 special case in each role on stretch, so that people could make stretch vs. buster comparisons. Some people had asked for that.
They are: mw1307 jobrunner/videoscaler, mw1402 API server, m1403 appserver.
Now planning to finally upgrade them to buster as well tomorrow to make that 99% a 100%.
Please stop me if you still see a reason for having any stretch appserver.
And additionally I would also delete mwdebug1003, the ganeti VM on stretch that was also there just for the special stretch/buster comparison use case. Would anyone miss it?
mwdebug1001/1002 are on buster all this time and won't be changing.
On Thu, Apr 15, 2021 at 2:58 PM Daniel Zahn dzahn@wikimedia.org wrote:
Hi all, one final follow-up,
It's been a while since 99% of appservers are on buster but we had still kept 1 special case in each role on stretch, so that people could make stretch vs. buster comparisons. Some people had asked for that.
They are: mw1307 jobrunner/videoscaler, mw1402 API server, m1403 appserver.
Now planning to finally upgrade them to buster as well tomorrow to make that 99% a 100%.
Please stop me if you still see a reason for having any stretch appserver.
wikitech-l@lists.wikimedia.org