(If you don’t work with links tables such as templatelinks, pagelinks and so on, feel free to ignore this message)
TLDR: The schema of links tables (starting with templatelinks) will change to have numeric id pointing to linktarget table instead of repeating namespace and title.
Hello,
The current schema and storage of most links tables are: page id (the source), namespace id of the target link and title of the target. For example, if a page with id of 1 uses Template:Foo, the row in the database would be 1, 6, and Foo (Template namespace has id of 6)
Repeating the target’s title is not sustainable, for example more than half of Wikimedia Commons database is just three links tables. The sheer size of these tables makes a considerable portion of all queries slower, backups and dumps taking longer and taking much more space than needed due to unnecessary duplication. In Wikimedia Commons, on average a title is duplicated around 100 times for templatelinks and around 20 times for pagelinks. The numbers for other wikis depend on the usage patterns.
Moving forward, these tables will be normalized, meaning a typical row will hold mapping of page id to linktarget id instead. Linktarget is a new table deployed in production and contains immutable records of namespace id and string. The major differences between page and linktarget tables are: 1- linktarget values won’t change (unlike page records that change with page move) 2- linktarget values can point to non-existent pages (=red links).
The first table being done is templatelinks, then pagelinks, imagelinks and categorylinks will follow. During the migration phase both values will be accessible but we will turn off writing to the old columns once the values are backfilled and switched to be read from the new schema. We will announce any major changes beforehand but this is to let you know these changes are coming.
While the normalization of all links tables will take several years to finish, templatelinks will finish in the next few months and is the most pressing one.
So if you:
-
… rely on the schema of these tables in cloud replicas, you will need to change your tools. -
… rely on dumps of these tables, you will need to change your scripts.
Currently, templatelinks writes to both data schemes for new rows in most wikis. This week we will start backfilling the data with the new schema but it will take months to finish in large wikis.
You can keep track of the general long-term work in https://phabricator.wikimedia.org/T300222 and the specific work for templatelinks in https://phabricator.wikimedia.org/T299417. You can also read more on the reasoning in https://phabricator.wikimedia.org/T222224.
Thanks
An update on migration of templatelinks:
With the exception of English Wikipedia and Wikimedia Commons, where the migration is still ongoing, every other wiki is now fully migrated to the normalized schema and is being read using the linktarget table in production.
Please migrate your tools to use the new schema as we will slowly stop writing to tl_namespace and tl_title fields and drop them in the next month.
Best
Am Di., 29. März 2022 um 15:37 Uhr schrieb Amir Sarabadani < asarabadani@wikimedia.org>:
(If you don’t work with links tables such as templatelinks, pagelinks and so on, feel free to ignore this message)
TLDR: The schema of links tables (starting with templatelinks) will change to have numeric id pointing to linktarget table instead of repeating namespace and title.
Hello,
The current schema and storage of most links tables are: page id (the source), namespace id of the target link and title of the target. For example, if a page with id of 1 uses Template:Foo, the row in the database would be 1, 6, and Foo (Template namespace has id of 6)
Repeating the target’s title is not sustainable, for example more than half of Wikimedia Commons database is just three links tables. The sheer size of these tables makes a considerable portion of all queries slower, backups and dumps taking longer and taking much more space than needed due to unnecessary duplication. In Wikimedia Commons, on average a title is duplicated around 100 times for templatelinks and around 20 times for pagelinks. The numbers for other wikis depend on the usage patterns.
Moving forward, these tables will be normalized, meaning a typical row will hold mapping of page id to linktarget id instead. Linktarget is a new table deployed in production and contains immutable records of namespace id and string. The major differences between page and linktarget tables are: 1- linktarget values won’t change (unlike page records that change with page move) 2- linktarget values can point to non-existent pages (=red links).
The first table being done is templatelinks, then pagelinks, imagelinks and categorylinks will follow. During the migration phase both values will be accessible but we will turn off writing to the old columns once the values are backfilled and switched to be read from the new schema. We will announce any major changes beforehand but this is to let you know these changes are coming.
While the normalization of all links tables will take several years to finish, templatelinks will finish in the next few months and is the most pressing one.
So if you:
… rely on the schema of these tables in cloud replicas, you will need to change your tools.
… rely on dumps of these tables, you will need to change your scripts.
Currently, templatelinks writes to both data schemes for new rows in most wikis. This week we will start backfilling the data with the new schema but it will take months to finish in large wikis.
You can keep track of the general long-term work in https://phabricator.wikimedia.org/T300222 and the specific work for templatelinks in https://phabricator.wikimedia.org/T299417. You can also read more on the reasoning in https://phabricator.wikimedia.org/T222224.
Thanks
*Amir Sarabadani (he/him)* Staff Database Architect Wikimedia Foundation https://wikimediafoundation.org/
Hello, Progress report: The data migration has been everywhere except on English Wikipedia. So if you use templatelinks for English Wikipedia, you still need to rely on the old schema. The data migration will finish in English Wikipedia in around two weeks.
We have stopped writing to the old columns in s5 and s2 this week (list of wikis on s5 https://noc.wikimedia.org/conf/highlight.php?file=dblists/s5.dblist and list of wikis on s2 https://noc.wikimedia.org/conf/highlight.php?file=dblists/s2.dblist) and we will drop those columns in the aforementioned sections next week (August 15-21). The rest of the sections (except s1, English Wikipedia) will follow the week after (August 22-28).
If you're relying on the old templatelinks schema, you are already getting incorrect data and soon your code or queries will start to error due querying missing fields.
We measured the impact of this normalization on the health of our databases. Here are the reports for s5 https://phabricator.wikimedia.org/T299417#8143953 and s2 https://phabricator.wikimedia.org/T314041#8146798.
Best
Am Di., 31. Mai 2022 um 15:42 Uhr schrieb Amir Sarabadani < asarabadani@wikimedia.org>:
An update on migration of templatelinks:
With the exception of English Wikipedia and Wikimedia Commons, where the migration is still ongoing, every other wiki is now fully migrated to the normalized schema and is being read using the linktarget table in production.
Please migrate your tools to use the new schema as we will slowly stop writing to tl_namespace and tl_title fields and drop them in the next month.
Best
Am Di., 29. März 2022 um 15:37 Uhr schrieb Amir Sarabadani < asarabadani@wikimedia.org>:
(If you don’t work with links tables such as templatelinks, pagelinks and so on, feel free to ignore this message)
TLDR: The schema of links tables (starting with templatelinks) will change to have numeric id pointing to linktarget table instead of repeating namespace and title.
Hello,
The current schema and storage of most links tables are: page id (the source), namespace id of the target link and title of the target. For example, if a page with id of 1 uses Template:Foo, the row in the database would be 1, 6, and Foo (Template namespace has id of 6)
Repeating the target’s title is not sustainable, for example more than half of Wikimedia Commons database is just three links tables. The sheer size of these tables makes a considerable portion of all queries slower, backups and dumps taking longer and taking much more space than needed due to unnecessary duplication. In Wikimedia Commons, on average a title is duplicated around 100 times for templatelinks and around 20 times for pagelinks. The numbers for other wikis depend on the usage patterns.
Moving forward, these tables will be normalized, meaning a typical row will hold mapping of page id to linktarget id instead. Linktarget is a new table deployed in production and contains immutable records of namespace id and string. The major differences between page and linktarget tables are: 1- linktarget values won’t change (unlike page records that change with page move) 2- linktarget values can point to non-existent pages (=red links).
The first table being done is templatelinks, then pagelinks, imagelinks and categorylinks will follow. During the migration phase both values will be accessible but we will turn off writing to the old columns once the values are backfilled and switched to be read from the new schema. We will announce any major changes beforehand but this is to let you know these changes are coming.
While the normalization of all links tables will take several years to finish, templatelinks will finish in the next few months and is the most pressing one.
So if you:
… rely on the schema of these tables in cloud replicas, you will need to change your tools.
… rely on dumps of these tables, you will need to change your scripts.
Currently, templatelinks writes to both data schemes for new rows in most wikis. This week we will start backfilling the data with the new schema but it will take months to finish in large wikis.
You can keep track of the general long-term work in https://phabricator.wikimedia.org/T300222 and the specific work for templatelinks in https://phabricator.wikimedia.org/T299417. You can also read more on the reasoning in https://phabricator.wikimedia.org/T222224.
Thanks
*Amir Sarabadani (he/him)* Staff Database Architect Wikimedia Foundation https://wikimediafoundation.org/