Hi!
Like I mentioned in the past, I've been cataloging input from developers and doing some analysis from the random query logger that Brooke coded in the old cluster to get a better idea of the kinds of cross DB queries being performed.
The input and random query logger analysis showcase the tools and which wiki DBs are being used together, but are not perfect, so please if you know of high impact tools that are going to suffer the breaking changes let me know so that I can include them.
The most prominent cases that this data shows are
- Querying Commons and another DB (as highlighted in many phab tasks and conversations, no surprise)
- Querying Wikidata and other DBs (commons, and wikipedias, for example)
- Querying En Wikipedia and other DBs (wikipedias)
Particularly noticeable is the appearance of arwiki in the analysis. It is hard to know if there is some bias in the random sampling or tools for this project do use the replicas a lot for their features. Something to look into.
Detailed analysis of the tables and fields joined / accessed on will need to be performed on an individual basis, since the SQL queries can be very complex and do things in different ways, so automated detection can be faulty. For example on manual inspection I've seen queries using subqueries on a different DB with `in`, instead of joins.
This is the reason why the report looks at the cross DB queries and not cross JOIN queries, to accurately look at all cross DB queries, even if there can be some false positives which can be worked around in the new architecture (I think the tools.guc queries for example).
I have created T280152 Mitigate breaking changes from the new Wiki Replicas architecture which lists the links and can be used to hang tasks for mitigation work if needed, and sits between the OLAP and the new Wiki Replicas task.
My hope is this information can be used both by developers and WMF teams to figure out ways to mitigate the breaking changes for the tools that editors and other users rely on. If you think the published CSV is not approachable enough and you think certain information on the wiki page would be useful to have, please let me know and I'll process the data and post the results in the page for ease of reading.
For example, I was wondering if I should publish the "unique normalized/stripped" queries per tool to be able to look at the SQL on that wiki page itself. Thoughts?
--
Joaquin Oltra Hernandez
Developer Advocate - Wikimedia Foundation