After many years of work, I'm happy to announce a milestone in addressing
one of our major areas of tech debt in database infrastructure: we have
eliminated all schema drifts between MediaWiki core and production.
It all started six years ago when users in English Wikipedia reported that
checking history of some pages is quite slow *at random*. More in-depth
analysis showed the revision table in English Wikipedia was missing an
important index in some of the replicas. An audit of the schema of the
revision table revealed much bigger drifts in the revision table of that
Wiki. You can read more in its ticket: T132416
Lack of schema parity between expectation and reality is quite dangerous.
Trying to force an index in code assuming it would exist in production
(under the same name) would cause fatal error every time it’s attempted.
Trying to write to a field that doesn’t exist is similar. Such changes
easily pass tests and work well in our test setups (such as beta cluster)
just to cause an outage in production.
If only one table in one Wiki had this many drifts, looking at all Wikis
and all tables became of vital importance. We have around ~1,000 wikis,
~200 hosts (each one hosting on average ~100 Wikis), and each Wiki has
around ~130 tables (half of them being tables from MediaWiki core) and each
table can have multiple drifts.
We slowly started looking for and addressing schema drifts five years ago
and later automated the discovery by utilizing abstract schema (before
that, the tool had to parse SQL) and discovered an overwhelming number of
drifts. You can look at the history of the work in T104459
Around fifty tickets addressing the drifts have been completed and they are
collected in T312538 <https://phabricator.wikimedia.org/T312538>. I suggest
checking some of them to see the scale of the work done. Each one of these
tickets took days to months of work to finish. Large number of them also
existed in primary databases, requiring a primary switchover and read-only
time for one or more Wikis. Each drift was different, in some cases, you
needed to change the code and not production so it needed a thorough
Why do such drifts happen? The most common reason was when a schema change
happened in code but it was never requested to be applied in production.
For example, a schema change in code in 2007 led to having any wiki created
before that date to have a different schema than wikis created after it. We
and tooling to make sure this doesn’t happen anymore in 2015 but we still
needed to address previous drifts. The second common reason was when a host
didn’t get the schema change for various reasons (was out of rotation when
the schema was being applied, a shortcoming of the manual process). By
automating <https://wikitech.wikimedia.org/wiki/Auto_schema> most of the
schema change operational work we reduced the chance of such drifts from
happening as well.
After finishing core, we now need to look at WMF-deployed extensions,
starting with FlaggedRevs <https://phabricator.wikimedia.org/T313253> that,
while being deployed to only 50 wikis and having only 8 tables, has ~7,000
drifts. Thankfully, most other extensions are in a healthier state.
I would like to personally thank Manuel Arostegui and Jaime Crespo for
their monumental dedication to fix these issues in the past years. Also a
big thank you to several of our amazing developers, Umherirrender, James
Forrester and Sam Reed who helped on reporting, going through the history
of MediaWiki to figure out why these drifts happened, and helping build the
*Amir Sarabadani (he/him)*
Staff Database Architect
Wikimedia Foundation <https://wikimediafoundation.org/>