An update on migration of templatelinks:
With the exception of English Wikipedia and Wikimedia Commons, where the
migration is still ongoing, every other wiki is now fully migrated to the
normalized schema and is being read using the linktarget table in
production.
Please migrate your tools to use the new schema as we will slowly stop
writing to tl_namespace and tl_title fields and drop them in the next month.
Best
Am Di., 29. März 2022 um 15:37 Uhr schrieb Amir Sarabadani <
asarabadani(a)wikimedia.org>gt;:
(If you don’t work with links tables such as
templatelinks, pagelinks and
so on, feel free to ignore this message)
TLDR: The schema of links tables (starting with templatelinks) will
change to have numeric id pointing to linktarget table instead of repeating
namespace and title.
Hello,
The current schema and storage of most links tables are: page id (the
source), namespace id of the target link and title of the target. For
example, if a page with id of 1 uses Template:Foo, the row in the database
would be 1, 6, and Foo (Template namespace has id of 6)
Repeating the target’s title is not sustainable, for example more than
half of Wikimedia Commons database is just three links tables. The sheer
size of these tables makes a considerable portion of all queries slower,
backups and dumps taking longer and taking much more space than needed due
to unnecessary duplication. In Wikimedia Commons, on average a title is
duplicated around 100 times for templatelinks and around 20 times for
pagelinks. The numbers for other wikis depend on the usage patterns.
Moving forward, these tables will be normalized, meaning a typical row
will hold mapping of page id to linktarget id instead. Linktarget is a new
table deployed in production and contains immutable records of namespace id
and string. The major differences between page and linktarget tables are:
1- linktarget values won’t change (unlike page records that change with
page move) 2- linktarget values can point to non-existent pages (=red
links).
The first table being done is templatelinks, then pagelinks, imagelinks
and categorylinks will follow. During the migration phase both values will
be accessible but we will turn off writing to the old columns once the
values are backfilled and switched to be read from the new schema. We will
announce any major changes beforehand but this is to let you know these
changes are coming.
While the normalization of all links tables will take several years to
finish, templatelinks will finish in the next few months and is the most
pressing one.
So if you:
-
… rely on the schema of these tables in cloud replicas, you will need
to change your tools.
-
… rely on dumps of these tables, you will need to change your scripts.
Currently, templatelinks writes to both data schemes for new rows in most
wikis. This week we will start backfilling the data with the new schema but
it will take months to finish in large wikis.
You can keep track of the general long-term work in
https://phabricator.wikimedia.org/T300222 and the specific work for
templatelinks in
https://phabricator.wikimedia.org/T299417. You can also
read more on the reasoning in
https://phabricator.wikimedia.org/T222224.
Thanks
--
*Amir Sarabadani (he/him)*
Staff Database Architect
Wikimedia Foundation <https://wikimediafoundation.org/>