Hello,

After many years of work, I'm happy to announce a milestone in addressing one of our major areas of tech debt in database infrastructure: we have eliminated all schema drifts between MediaWiki core and production.


It all started six years ago when users in English Wikipedia reported that checking history of some pages is quite slow *at random*. More in-depth analysis showed the revision table in English Wikipedia was missing an important index in some of the replicas. An audit of the schema of the revision table revealed much bigger drifts in the revision table of that Wiki. You can read more in its ticket: T132416


Lack of schema parity between expectation and reality is quite dangerous. Trying to force an index in code assuming it would exist in production (under the same name) would cause fatal error every time it’s attempted. Trying to write to a field that doesn’t exist is similar. Such changes easily pass tests and work well in our test setups (such as beta cluster) just to cause an outage in production.


If only one table in one Wiki had this many drifts, looking at all Wikis and all tables became of vital importance. We have around ~1,000 wikis, ~200 hosts (each one hosting on average ~100 Wikis), and each Wiki has around ~130 tables (half of them being tables from MediaWiki core) and each table can have multiple drifts.


We slowly started looking for and addressing schema drifts five years ago and later automated the discovery by utilizing abstract schema (before that, the tool had to parse SQL) and discovered an overwhelming number of drifts. You can look at the history of the work in T104459.


Around fifty tickets addressing the drifts have been completed and they are collected in T312538. I suggest checking some of them to see the scale of the work done. Each one of these tickets took days to months of work to finish. Large number of them also existed in primary databases, requiring a primary switchover and read-only time for one or more Wikis. Each drift was different, in some cases, you needed to change the code and not production so it needed a thorough investigation.


Why do such drifts happen? The most common reason was when a schema change happened in code but it was never requested to be applied in production. For example, a schema change in code in 2007 led to having any wiki created before that date to have a different schema than wikis created after it. We introduced processes and tooling to make sure this doesn’t happen anymore in 2015 but we still needed to address previous drifts. The second common reason was when a host didn’t get the schema change for various reasons (was out of rotation when the schema was being applied, a shortcoming of the manual process). By automating most of the schema change operational work we reduced the chance of such drifts from happening as well.


After finishing core, we now need to look at WMF-deployed extensions, starting with FlaggedRevs that, while being deployed to only 50 wikis and having only 8 tables, has ~7,000 drifts. Thankfully, most other extensions are in a healthier state.


I would like to personally thank Manuel Arostegui and Jaime Crespo for their monumental dedication to fix these issues in the past years. Also a big thank you to several of our amazing developers, Umherirrender, James Forrester and Sam Reed who helped on reporting, going through the history of MediaWiki to figure out why these drifts happened, and helping build the reporting tools.

Best
--
Amir Sarabadani (he/him)
Staff Database Architect
Wikimedia Foundation