Re-introducing MediaWiki CLI profiler for maintenance scripts - Wikitech-l

17 May 2023

TLDR: Those with shell access at WMF can now run maintenance scripts on mwdebug hosts, and
use the *--profiler=text* option to produce a report detailing how long the script spent
in each MediaWiki component, class, and function.

== *What* ==

The MediaWiki platform at WMF consists of broadly four different deployments: app servers,
api_app servers, job runners, and maint servers (diagram
<https://wikitech.wikimedia.org/wiki/MediaWiki_at_WMF#/media/File:MediaWiki_infrastructure_2022.png>).
Each can be thought of as its own service for a specific purpose, composed of a subset of
components from the MediaWiki codebase with fast local access to each. The largest of
these is the appservers cluster, which is served by 150 servers of dedicated hardware
across the two application data centers
<https://wikitech.wikimedia.org/wiki/Data_centers>, and is responsible for
responding to index.php (i.e. page views) and load.php (CSS, JS, and localisation via
ResourceLoader).

Today, we focus on the smallest of these: mwmaint servers
<https://wikitech.wikimedia.org/wiki/Maintenance_server>. This is backed by two
heavy-duty servers, one in each data center, that autonomously run essential tasks at a
predefined schedule (i.e. not in direct response to a user action). Each of these ~50
different tasks is implemented as a MediaWiki maintenance script. Important examples
include: sending email notifications (Echo extension), timely pruning of sensitive PII
(CheckUser extension), computing mentee and link recommendation data (GrowthExperiments),
and reclaiming disk space for expired caches (core/ParserCache).

== *Why ==*

We have detailed debug performance profiling in production for web requests via the
WikimediaDebug extension, and we have detailed profiling in local development for both web
requests and maintenance scripts (Docker recipe
<https://www.mediawiki.org/wiki/MediaWiki-Docker/Configuration_recipes/Profiling>).

What was missing is a way to profile maintenance scripts in production. This is important
as maintenance scripts tend take many minutes or hours to process vast amounts of
production data. While generally easy to debug locally for functional analysis, the
performance bottlenecks individual teams care about are likely specific to the size of the
data and the performance of other production components.

Thanks also to Ahmon Dancy (RelEng), Giuseppe Lavagetto (SRE), and Aaron Schulz
(Performance Team) for making this work possible, and Niklas Laxström (LangEng) for coming
up with the idea.

== *What's New* ==

Documentation: https://wikitech.wikimedia.org/wiki/WikimediaDebug#Plaintext_CLI_profile

To profile a Maintenance script, run the script from the shell with *mwscript* as you
normally would, but instead of connecting your terminal to mwmaint1002.eqiad.wmnet,
connect to one of the *mwdebug* hosts (such as mwdebug1001.eqiad.wmnet). Then pass the
*--profiler=text* option to generate a report with the performance analysis, which will be
printed after the task is finished. Like so:

...
  $ mwscript showSiteStats.php --wiki=nlwiki
--profiler=text
 Number of articles:  2122688
 Number of users   :  1276507

 <!--
  100.00% 114.964   1 - main()
  …
  22.42% 25.776     1 - ShowSiteStats::execute
  16.61% 19.096     2 - Wikimedia\Rdbms\LoadBalancer::getServerConnection
   4.80% 5.522      1 - Maintenance::shutdown
   4.41% 5.065      1 - Wikimedia\Rdbms\Database::initConnection
   3.07% 3.530      1 - DeferredUpdates::doUpdates
   2.66% 3.061      1 - Wikimedia\Rdbms\Database::select
   2.38% 2.739      1 - Wikimedia\Rdbms\Database::query
   1.95% 2.240      1 - section.SELECT * FROM `site_stats` LIMIT N 
   1.48% 1.700      1 - Wikimedia\Rdbms\DatabaseMysqli::mysqlConnect
   …
 --> 
== *A peak behind the curtain* ==

Read on if you'd like to learn what hurdles we had to overcome for this to
"simply" work in production, like it did for local development. The journey
started when Niklas (WMF LangEng) proposed the idea at
https://phabricator.wikimedia.org/T253547.

*Firstly*, the profiler engine. In 2019 (blogpost
<https://techblog.wikimedia.org/2019/12/16/wikimediadebug-v2-is-here/>), after we
migrated from HHVM to PHP 7, we had to look for a new profiler engine for backend
performance. We adopted the open source php-tideways package, and this has powered our
browser-facing profiler since. Naturally, this was already installed on the mwdebug
servers for that purpose. However, the package, and the accompanying *rdtsc* setting, were
only set for php-fpm (web server), it was not yet enabled for php-cli (command-line).

*Secondly*, the Profiler component in MediaWiki core had gotten out of sync with the needs
of the Skin and Maintenance components. Over years of refactoring and more parts of each
component gaining an active owner, the parts that lacked an owner eroded and stopped
working, including the integration between these components. We decided to take active
ownership over the remaining parts of MediaWiki-Core-Profiler and fix the disconnect. The
meta work for that included identifying and re-triaging open issues under a new
#mediawiki-core-profiler
<https://phabricator.wikimedia.org/tag/mediawiki-core-profiler/> tag, automating
discovery to our team inbox via Phabricator Herald rule, enlisting on Maintainers
<https://www.mediawiki.org/wiki/Developers/Maintainers>, and automatic discovery of
changesets to our code review dashboard.
<https://gerrit.wikimedia.org/r/p/wikimedia/+/dashboard/teams:performance>

The "output" step of the profiler was originally the responsibility of a
wfLogProfilingData function that both webpages (Skin) and CLI (Maintenance) called upon
toward the end of the response. After this function became deprecated, and further
refactoring in the Skin and Maintenance component took on (some of) its responsibilities
directly, the "output" step got lost. Adding this back in was non-trivial
because by now, the code in question had gained an unintentional dependency on the
WebRequest object, which is not valid in a CLI context. The code in question was
simplified and decoupled in change 725152
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/725152/> and change 725440
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/725440>, when then made it
possible to add back the "output" step in change 838884
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/838884/>. We also removed various
options
<https://gerrit.wikimedia.org/r/q/message:profiler+project:mediawiki/core+branch:master+is:merged>
(such as CMDLINE_CALLBACK <https://phabricator.wikimedia.org/T305422>) that we
choose to not support and maintain going forward (undocumented, unused, broken, or without
known use case after research).

*Thirdly*, in parallel with our work above, another team refactored MediaWiki's
SettingsLoader and MaintenanceRunner components. The Profiler dates back more than a
decade and still relied on the fact that settings could be changed by CLI script options
such as *--profiler=text*. The new SettingsLoader and the refactored MaintenanceRunner
streamlined the order of operations during process startup, which had the side effect of
initialising the profiler slightly earlier than it used to. This meant it would initialise
before the --profiler=text option was applied, and thus the profiler that was initialised
was unconditionally the "null" profiler. This did not produce an error since
that is a valid configuration, the configuration of effectively all traffic besides ad-hoc
debugging. I remedied this
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/875396/> by recognising the CLI
option as its own setting (separate from the main one), and passing it down via dependency
injection. Thus not relying on the subtle order of operations. 

The Profiler component is now significantly leaner than it was before, and its
requirements are either explicitly coded through dependency injection requirements, or
simplified/refactored such that they do not place requirements on other components.

*Fourthly*, the wmf-config repository, where we control how and when MediaWiki Profiler
can be enabled, had been changed by us the year before, to feature a new sampling profiler
to produce detailed flame graphs over production traffic (blog post
<https://techblog.wikimedia.org/2021/03/03/profiling-php-in-production-at-scale/>).
I realised that this configuration assumed a web context. Another tweak over there
<https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/874419/> made this
safe to enabled in CLI/Maintenance script contexts.

*Fifthly*, in order to actually enable the wmf-config/Profiler code in CLI, we needed to
set the same *php.ini* configuration for php-cli as we already did years earlier for
php-fpm (web server). In doing so, we ran into the problem that one server in particular
(deploy1002, from where the MediaWiki train runs via Scap) had two mutually exclusive
copies of the MediaWiki deployment (/srv/mediawiki and /srv/mediawiki-staging). This meant
that preloading a profiler for either of them from php-cli would break the other. I
identified this long-standing technical debt to Ahmon Dancy (RelEng), who then went on a
whole journey of his own to radically simplify our deployment servers to not have these
two separate installations (T329857 <https://phabricator.wikimedia.org/T329857>).

*Finally*, with a one-line change in Puppet
<https://gerrit.wikimedia.org/r/c/operations/puppet/+/910882/> config, the CLI
profiler is now enabled in production (Thanks Giuseppe Lavagetto, from SRE). All in all,
this made the different parts of codebase, and different parts of our platform less
divergent and more unified than before.

Thanks again to Aaron Schulz, Ahmon Dancy, Giuseppe Lavagetto, and Niklas Laxström for
their help!

--
Timo Tijhof,
Principal Engineer,
Performance Team,
Wikimedia Foundation.