TLDR: Those with shell access at WMF can now run maintenance scripts on mwdebug hosts, and use the *--profiler=text* option to produce a report detailing how long the script spent in each MediaWiki component, class, and function.
== *What* ==
The MediaWiki platform at WMF consists of broadly four different deployments: app servers, api_app servers, job runners, and maint servers (diagram https://wikitech.wikimedia.org/wiki/MediaWiki_at_WMF#/media/File:MediaWiki_infrastructure_2022.png). Each can be thought of as its own service for a specific purpose, composed of a subset of components from the MediaWiki codebase with fast local access to each. The largest of these is the appservers cluster, which is served by 150 servers of dedicated hardware across the two application data centers https://wikitech.wikimedia.org/wiki/Data_centers, and is responsible for responding to index.php (i.e. page views) and load.php (CSS, JS, and localisation via ResourceLoader).
Today, we focus on the smallest of these: mwmaint servers https://wikitech.wikimedia.org/wiki/Maintenance_server. This is backed by two heavy-duty servers, one in each data center, that autonomously run essential tasks at a predefined schedule (i.e. not in direct response to a user action). Each of these ~50 different tasks is implemented as a MediaWiki maintenance script. Important examples include: sending email notifications (Echo extension), timely pruning of sensitive PII (CheckUser extension), computing mentee and link recommendation data (GrowthExperiments), and reclaiming disk space for expired caches (core/ParserCache).
== *Why ==*
We have detailed debug performance profiling in production for web requests via the WikimediaDebug extension, and we have detailed profiling in local development for both web requests and maintenance scripts (Docker recipe https://www.mediawiki.org/wiki/MediaWiki-Docker/Configuration_recipes/Profiling).
What was missing is a way to profile maintenance scripts in production. This is important as maintenance scripts tend take many minutes or hours to process vast amounts of production data. While generally easy to debug locally for functional analysis, the performance bottlenecks individual teams care about are likely specific to the size of the data and the performance of other production components.
Thanks also to Ahmon Dancy (RelEng), Giuseppe Lavagetto (SRE), and Aaron Schulz (Performance Team) for making this work possible, and Niklas Laxström (LangEng) for coming up with the idea.
== *What's New* ==
Documentation: https://wikitech.wikimedia.org/wiki/WikimediaDebug#Plaintext_CLI_profile
To profile a Maintenance script, run the script from the shell with *mwscript* as you normally would, but instead of connecting your terminal to mwmaint1002.eqiad.wmnet, connect to one of the *mwdebug* hosts (such as mwdebug1001.eqiad.wmnet). Then pass the *--profiler=text* option to generate a report with the performance analysis, which will be printed after the task is finished. Like so:
$ mwscript showSiteStats.php --wiki=nlwiki --profiler=text Number of articles: 2122688 Number of users : 1276507
<!-- 100.00% 114.964 1 - main() … 22.42% 25.776 1 - ShowSiteStats::execute 16.61% 19.096 2 - Wikimedia\Rdbms\LoadBalancer::getServerConnection 4.80% 5.522 1 - Maintenance::shutdown 4.41% 5.065 1 - Wikimedia\Rdbms\Database::initConnection 3.07% 3.530 1 - DeferredUpdates::doUpdates 2.66% 3.061 1 - Wikimedia\Rdbms\Database::select 2.38% 2.739 1 - Wikimedia\Rdbms\Database::query 1.95% 2.240 1 - section.SELECT * FROM `site_stats` LIMIT N 1.48% 1.700 1 - Wikimedia\Rdbms\DatabaseMysqli::mysqlConnect … -->
== *A peak behind the curtain* ==
Read on if you'd like to learn what hurdles we had to overcome for this to "simply" work in production, like it did for local development. The journey started when Niklas (WMF LangEng) proposed the idea at https://phabricator.wikimedia.org/T253547.
*Firstly*, the profiler engine. In 2019 (blogpost https://techblog.wikimedia.org/2019/12/16/wikimediadebug-v2-is-here/), after we migrated from HHVM to PHP 7, we had to look for a new profiler engine for backend performance. We adopted the open source php-tideways package, and this has powered our browser-facing profiler since. Naturally, this was already installed on the mwdebug servers for that purpose. However, the package, and the accompanying *rdtsc* setting, were only set for php-fpm (web server), it was not yet enabled for php-cli (command-line).
*Secondly*, the Profiler component in MediaWiki core had gotten out of sync with the needs of the Skin and Maintenance components. Over years of refactoring and more parts of each component gaining an active owner, the parts that lacked an owner eroded and stopped working, including the integration between these components. We decided to take active ownership over the remaining parts of MediaWiki-Core-Profiler and fix the disconnect. The meta work for that included identifying and re-triaging open issues under a new #mediawiki-core-profiler https://phabricator.wikimedia.org/tag/mediawiki-core-profiler/ tag, automating discovery to our team inbox via Phabricator Herald rule, enlisting on Maintainers https://www.mediawiki.org/wiki/Developers/Maintainers, and automatic discovery of changesets to our code review dashboard. https://gerrit.wikimedia.org/r/p/wikimedia/+/dashboard/teams:performance
The "output" step of the profiler was originally the responsibility of a wfLogProfilingData function that both webpages (Skin) and CLI (Maintenance) called upon toward the end of the response. After this function became deprecated, and further refactoring in the Skin and Maintenance component took on (some of) its responsibilities directly, the "output" step got lost. Adding this back in was non-trivial because by now, the code in question had gained an unintentional dependency on the WebRequest object, which is not valid in a CLI context. The code in question was simplified and decoupled in change 725152 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/725152/ and change 725440 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/725440, when then made it possible to add back the "output" step in change 838884 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/838884/. We also removed various options https://gerrit.wikimedia.org/r/q/message:profiler+project:mediawiki/core+branch:master+is:merged (such as CMDLINE_CALLBACK https://phabricator.wikimedia.org/T305422) that we choose to not support and maintain going forward (undocumented, unused, broken, or without known use case after research).
*Thirdly*, in parallel with our work above, another team refactored MediaWiki's SettingsLoader and MaintenanceRunner components. The Profiler dates back more than a decade and still relied on the fact that settings could be changed by CLI script options such as *--profiler=text*. The new SettingsLoader and the refactored MaintenanceRunner streamlined the order of operations during process startup, which had the side effect of initialising the profiler slightly earlier than it used to. This meant it would initialise before the --profiler=text option was applied, and thus the profiler that was initialised was unconditionally the "null" profiler. This did not produce an error since that is a valid configuration, the configuration of effectively all traffic besides ad-hoc debugging. I remedied this https://gerrit.wikimedia.org/r/c/mediawiki/core/+/875396/ by recognising the CLI option as its own setting (separate from the main one), and passing it down via dependency injection. Thus not relying on the subtle order of operations.
The Profiler component is now significantly leaner than it was before, and its requirements are either explicitly coded through dependency injection requirements, or simplified/refactored such that they do not place requirements on other components.
*Fourthly*, the wmf-config repository, where we control how and when MediaWiki Profiler can be enabled, had been changed by us the year before, to feature a new sampling profiler to produce detailed flame graphs over production traffic (blog post https://techblog.wikimedia.org/2021/03/03/profiling-php-in-production-at-scale/). I realised that this configuration assumed a web context. Another tweak over there https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/874419/ made this safe to enabled in CLI/Maintenance script contexts.
*Fifthly*, in order to actually enable the wmf-config/Profiler code in CLI, we needed to set the same *php.ini* configuration for php-cli as we already did years earlier for php-fpm (web server). In doing so, we ran into the problem that one server in particular (deploy1002, from where the MediaWiki train runs via Scap) had two mutually exclusive copies of the MediaWiki deployment (/srv/mediawiki and /srv/mediawiki-staging). This meant that preloading a profiler for either of them from php-cli would break the other. I identified this long-standing technical debt to Ahmon Dancy (RelEng), who then went on a whole journey of his own to radically simplify our deployment servers to not have these two separate installations (T329857 https://phabricator.wikimedia.org/T329857).
*Finally*, with a one-line change in Puppet https://gerrit.wikimedia.org/r/c/operations/puppet/+/910882/ config, the CLI profiler is now enabled in production (Thanks Giuseppe Lavagetto, from SRE). All in all, this made the different parts of codebase, and different parts of our platform less divergent and more unified than before.
Thanks again to Aaron Schulz, Ahmon Dancy, Giuseppe Lavagetto, and Niklas Laxström for their help!
-- Timo Tijhof, Principal Engineer, Performance Team, Wikimedia Foundation.
wikitech-l@lists.wikimedia.org