Hello,
We have communicated https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/JFMU43374T64BTJWI6WLZKLOJ4FL4PRP/ this change in August 2022 but here is a reminder that if you query externallinks table in wikireplicas, you will need to rework your queries.
Changes are: el_to, el_index and el_index_60 fields will be dropped and you need to query el_to_domain_index and el_to_path fields instead.
The data has been migrated in all wikis except English Wikipedia, Wikimedia Commons and Wikidata, you can keep track of the data migration in T326314 https://phabricator.wikimedia.org/T326314.
In late June, we will stop updating the old fields and will start dropping them
This is done to optimize storage of external links (drastically reducing its size by removing duplication) and enabling user requested features such as resolving two https://phabricator.wikimedia.org/T14810 fifteen-year-old tickets https://phabricator.wikimedia.org/T17218 or easier aggregating of external links domain [1], something that wasn’t possible until now.
[1] For example, If you want to get list of the top-linked websites, you can query `SELECT el_to_domain_index, count(*) from externallinks group by el_to_domain_index order by count(*) desc limit 50;`
Thank you and sorry for the inconvenience.
Hi, The data has been migrated everywhere for a while now. We are stopping to write to the old columns (el_to, el_index, and el_index_60) and have already dropped it in several places.
If you don't update your tools and queries, it'll break in the next couple of weeks.
Best
Am Mi., 7. Juni 2023 um 16:38 Uhr schrieb Amir Sarabadani < asarabadani@wikimedia.org>:
Hello,
We have communicated https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/JFMU43374T64BTJWI6WLZKLOJ4FL4PRP/ this change in August 2022 but here is a reminder that if you query externallinks table in wikireplicas, you will need to rework your queries.
Changes are: el_to, el_index and el_index_60 fields will be dropped and you need to query el_to_domain_index and el_to_path fields instead.
The data has been migrated in all wikis except English Wikipedia, Wikimedia Commons and Wikidata, you can keep track of the data migration in T326314 https://phabricator.wikimedia.org/T326314.
In late June, we will stop updating the old fields and will start dropping them
This is done to optimize storage of external links (drastically reducing its size by removing duplication) and enabling user requested features such as resolving two https://phabricator.wikimedia.org/T14810 fifteen-year-old tickets https://phabricator.wikimedia.org/T17218 or easier aggregating of external links domain [1], something that wasn’t possible until now.
[1] For example, If you want to get list of the top-linked websites, you can query `SELECT el_to_domain_index, count(*) from externallinks group by el_to_domain_index order by count(*) desc limit 50;`
Thank you and sorry for the inconvenience.
*Amir Sarabadani (he/him)* Staff Database Architect Wikimedia Foundation https://wikimediafoundation.org/
Hi Amir,
What is the recommended best practice for the (rather obvious) task of obtaining the actual external link URL from the new version of the externallinks table?
It seems that several users have grappled with this after the breaking change was deployed. [[User:Cryptic]] has a solution at https://quarry.wmcloud.org/query/77235 (yay for searchable Quarry queries!):
SELECT [...] CONCAT(REGEXP_REPLACE(el_to_domain_index,
'^(.*?://)(?:([^.]+)\.)([^.]+\.)?([^.]+\.)?([^.]+\.)?([^.]+\.)?([^.]+\.)?([^.]+\.)?([^.]+\.)$', '\1\9\8\7\6\5\4\3\2'), el_to_path) AS url FROM externallinks
This seems to work (thanks Cryptic!), but 1. is not very discoverable, 2. looks a bit hacky, and 3. might have introduced some performance disadvantages compared to the simpler "SELECT el_to" that was possible with the old version of the table.
I know that breaking changes are sometimes necessary and trust your judgement of the technical tradeoffs (the storage savings look impressive). But it's a bit surprising that the announcement touted various user-facing advantages without including advice about how to best mitigate this quite evident downside of the new format. A good place to do that now might be https://www.mediawiki.org/wiki/Manual:Externallinks_table . (I see that this page has already received some updates, largely by volunteers, in recent months *after* the breaking change was deployed. I would encourage WMF to be more proactive with this in the future. Mailing list announcements like this are great, but not a substitute for permanent documentation.)
Regards, Tilman
On Wed, Aug 9, 2023 at 3:02 AM Amir Sarabadani asarabadani@wikimedia.org wrote:
Hi, The data has been migrated everywhere for a while now. We are stopping to write to the old columns (el_to, el_index, and el_index_60) and have already dropped it in several places.
If you don't update your tools and queries, it'll break in the next couple of weeks.
Best
Am Mi., 7. Juni 2023 um 16:38 Uhr schrieb Amir Sarabadani < asarabadani@wikimedia.org>:
Hello,
We have communicated https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/JFMU43374T64BTJWI6WLZKLOJ4FL4PRP/ this change in August 2022 but here is a reminder that if you query externallinks table in wikireplicas, you will need to rework your queries.
Changes are: el_to, el_index and el_index_60 fields will be dropped and you need to query el_to_domain_index and el_to_path fields instead.
The data has been migrated in all wikis except English Wikipedia, Wikimedia Commons and Wikidata, you can keep track of the data migration in T326314 https://phabricator.wikimedia.org/T326314.
In late June, we will stop updating the old fields and will start dropping them
This is done to optimize storage of external links (drastically reducing its size by removing duplication) and enabling user requested features such as resolving two https://phabricator.wikimedia.org/T14810 fifteen-year-old tickets https://phabricator.wikimedia.org/T17218 or easier aggregating of external links domain [1], something that wasn’t possible until now.
[1] For example, If you want to get list of the top-linked websites, you can query `SELECT el_to_domain_index, count(*) from externallinks group by el_to_domain_index order by count(*) desc limit 50;`
Thank you and sorry for the inconvenience.
*Amir Sarabadani (he/him)* Staff Database Architect Wikimedia Foundation https://wikimediafoundation.org/
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Hi Tilman, Sorry for the late reply.
Regarding finding the actual link from the row. The recommended way is to do processing in code afterwards. That's what MediaWiki does (in https://gerrit.wikimedia.org/g/mediawiki/core/+/80790ffc21a49fbe7709eaf5ce63...) and you can easily replicate the logic of LinkFilter::reverseIndexes() in your programming language of choice. Doing all of data processing in SQL is not recommended.
Indeed, we these changes are really necessary. For example with the current growth of Wikimedia Commons we will have to resort to more drastic actions if its database growth doesn't slow down (See https://phabricator.wikimedia.org/T343131 and https://phabricator.wikimedia.org/F37157040). Noting that database growth is not always about the wiki's growth, lots of times it's just high use of some features of mediawiki (in here, templates and external links).
We will keep in mind to update documentation for further work (and thank you for the feedback!). The next will be pagelinks.
Best
While I understand the motivation for the change, I think in this case the label reversing used is more complex than needed. It would have been simpler (and thus likely more efficient / less bug-prone) to reverse the whole hostnames, storing moc.elpmaxe.www instad of com.example.www
On Mon, 11 Dec 2023 at 15:34, Amir Sarabadani asarabadani@wikimedia.org wrote:
Hi Tilman, Sorry for the late reply.
Regarding finding the actual link from the row. The recommended way is to do processing in code afterwards. That's what MediaWiki does (in https://gerrit.wikimedia.org/g/mediawiki/core/+/80790ffc21a49fbe7709eaf5ce63...) and you can easily replicate the logic of LinkFilter::reverseIndexes() in your programming language of choice. Doing all of data processing in SQL is not recommended.
Indeed, we these changes are really necessary. For example with the current growth of Wikimedia Commons we will have to resort to more drastic actions if its database growth doesn't slow down (See https://phabricator.wikimedia.org/T343131 and https://phabricator.wikimedia.org/F37157040). Noting that database growth is not always about the wiki's growth, lots of times it's just high use of some features of mediawiki (in here, templates and external links).
We will keep in mind to update documentation for further work (and thank you for the feedback!). The next will be pagelinks.
Best _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/