(If you don’t work with links tables such as templatelinks, pagelinks and so on, feel free to ignore this message)
TLDR: The schema of links tables (starting with templatelinks) will change to have numeric id pointing to linktarget table instead of repeating namespace and title.
Hello,
The current schema and storage of most links tables are: page id (the source), namespace id of the target link and title of the target. For example, if a page with id of 1 uses Template:Foo, the row in the database would be 1, 6, and Foo (Template namespace has id of 6)
Repeating the target’s title is not sustainable, for example more than half of Wikimedia Commons database is just three links tables. The sheer size of these tables makes a considerable portion of all queries slower, backups and dumps taking longer and taking much more space than needed due to unnecessary duplication. In Wikimedia Commons, on average a title is duplicated around 100 times for templatelinks and around 20 times for pagelinks. The numbers for other wikis depend on the usage patterns.
Moving forward, these tables will be normalized, meaning a typical row will hold mapping of page id to linktarget id instead. Linktarget is a new table deployed in production and contains immutable records of namespace id and string. The major differences between page and linktarget tables are: 1- linktarget values won’t change (unlike page records that change with page move) 2- linktarget values can point to non-existent pages (=red links).
The first table being done is templatelinks, then pagelinks, imagelinks and categorylinks will follow. During the migration phase both values will be accessible but we will turn off writing to the old columns once the values are backfilled and switched to be read from the new schema. We will announce any major changes beforehand but this is to let you know these changes are coming.
While the normalization of all links tables will take several years to finish, templatelinks will finish in the next few months and is the most pressing one.
So if you:
-
… rely on the schema of these tables in cloud replicas, you will need to change your tools. -
… rely on dumps of these tables, you will need to change your scripts.
Currently, templatelinks writes to both data schemes for new rows in most wikis. This week we will start backfilling the data with the new schema but it will take months to finish in large wikis.
You can keep track of the general long-term work in https://phabricator.wikimedia.org/T300222 and the specific work for templatelinks in https://phabricator.wikimedia.org/T299417. You can also read more on the reasoning in https://phabricator.wikimedia.org/T222224.
Thanks