Hi, Joaquin!
I was looking at the page the other day. I agree with your summary above, except I would add one more prominent case which somehow did not appear in the random query analysis but I think is critical:
- Querying CentralAuth and another DB
An example is this query https://quarry.wmflabs.org/query/53685.
In fact, I would argue that we should treat Commons+Wikidata+CentralAuth database as "shared" databases that need to be joined with local DBs for many use cases, and treat them as one group. The "En WP + other DB" use cases are a second group in my opinion, and may not call for all En WP data to be available elsewhere; only certain tables might be sufficient.
Thanks, Huji
On Tue, Apr 27, 2021 at 2:34 PM Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:
Hi!
Like I mentioned in the past, I've been cataloging input from developers and doing some analysis from the random query logger that Brooke coded in the old cluster to get a better idea of the kinds of cross DB queries being performed.
Wiki Replicas Cross-DB Query Data https://wikitech.wikimedia.org/wiki/News/Wikireplicas_2020_Redesign/Wiki_Replicas_Cross-DB_Query_Data
The input and random query logger analysis showcase the tools and which wiki DBs are being used together, but are not perfect, so please if you know of high impact tools that are going to suffer the breaking changes let me know so that I can include them.
The most prominent cases that this data shows are
- Querying Commons and another DB (as highlighted in many phab tasks
and conversations, no surprise)
- Querying Wikidata and other DBs (commons, and wikipedias, for
example)
- Querying En Wikipedia and other DBs (wikipedias)
Particularly noticeable is the appearance of arwiki in the analysis. It is hard to know if there is some bias in the random sampling or tools for this project do use the replicas a lot for their features. Something to look into.
Detailed analysis of the tables and fields joined / accessed on will need to be performed on an individual basis, since the SQL queries can be very complex and do things in different ways, so automated detection can be faulty. For example on manual inspection I've seen queries using subqueries on a different DB with `in`, instead of joins.
This is the reason why the report looks at the cross DB queries and not cross JOIN queries, to accurately look at all cross DB queries, even if there can be some false positives which can be worked around in the new architecture (I think the tools.guc queries for example).
I have created T280152 Mitigate breaking changes from the new Wiki Replicas architecture https://phabricator.wikimedia.org/T280152 which lists the links and can be used to hang tasks for mitigation work if needed, and sits between the OLAP and the new Wiki Replicas task.
My hope is this information can be used both by developers and WMF teams to figure out ways to mitigate the breaking changes for the tools that editors and other users rely on. If you think the published CSV is not approachable enough and you think certain information on the wiki page would be useful to have, please let me know and I'll process the data and post the results in the page for ease of reading.
For example, I was wondering if I should publish the "unique normalized/stripped" queries per tool to be able to look at the SQL on that wiki page itself. Thoughts? -- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud