Hi Amir,

What is the recommended best practice for the (rather obvious) task of obtaining the actual external link URL from the new version of the externallinks table? 

It seems that several users have grappled with this after the breaking change was deployed. [[User:Cryptic]] has a solution at https://quarry.wmcloud.org/query/77235 (yay for searchable Quarry queries!):

SELECT [...] CONCAT(REGEXP_REPLACE(el_to_domain_index,
                             '^(.*?://)(?:([^.]+)\\.)([^.]+\\.)?([^.]+\\.)?([^.]+\\.)?([^.]+\\.)?([^.]+\\.)?([^.]+\\.)?([^.]+\\.)$',
                             '\\1\\9\\8\\7\\6\\5\\4\\3\\2'),
              el_to_path) AS url
FROM externallinks

This seems to work (thanks Cryptic!), but 1. is not very discoverable, 2. looks a bit hacky, and 3. might have introduced some performance disadvantages compared to the simpler "SELECT el_to" that was possible with the old version of the table. 

I know that breaking changes are sometimes necessary and trust your judgement of the technical tradeoffs (the storage savings look impressive). But it's a bit surprising that the announcement touted various user-facing advantages without including advice about how to best mitigate this quite evident downside of the new format. A good place to do that now might be https://www.mediawiki.org/wiki/Manual:Externallinks_table . (I see that this page has already received some updates, largely by volunteers, in recent months *after* the breaking change was deployed. I would encourage WMF to be more proactive with this in the future. Mailing list announcements like this are great, but not a substitute for permanent documentation.)

Regards, Tilman

On Wed, Aug 9, 2023 at 3:02 AM Amir Sarabadani <asarabadani@wikimedia.org> wrote:
Hi,
The data has been migrated everywhere for a while now. We are stopping to write to the old columns (el_to, el_index, and el_index_60) and have already dropped it in several places.

If you don't update your tools and queries, it'll break in the next couple of weeks.

Best

Am Mi., 7. Juni 2023 um 16:38 Uhr schrieb Amir Sarabadani <asarabadani@wikimedia.org>:

Hello,

We have communicated this change in August 2022 but here is a reminder that if you query externallinks table in wikireplicas, you will need to rework your queries.


Changes are: el_to, el_index and el_index_60 fields will be dropped and you need to query el_to_domain_index and el_to_path fields instead.


The data has been migrated in all wikis except English Wikipedia, Wikimedia Commons and Wikidata, you can keep track of the data migration in T326314.


In late June, we will stop updating the old fields and will start dropping them 


This is done to optimize storage of external links (drastically reducing its size by removing duplication) and enabling user requested features such as resolving two fifteen-year-old tickets or easier aggregating of external links domain [1], something that wasn’t possible until now.


[1] For example, If you want to get list of the top-linked websites, you can query `SELECT el_to_domain_index, count(*) from externallinks group by el_to_domain_index order by count(*) desc limit 50;`


Thank you and sorry for the inconvenience.

--
Amir Sarabadani (he/him)
Staff Database Architect
Wikimedia Foundation
_______________________________________________
Cloud mailing list -- cloud@lists.wikimedia.org
List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/