Hi, Joaquin!
I was looking at the page the other day. I agree with your summary above,
except I would add one more prominent case which somehow did not appear in
the random query analysis but I think is critical:
- Querying CentralAuth and another DB
An example is this query <https://quarry.wmflabs.org/query/53685>.
In fact, I would argue that we should treat Commons+Wikidata+CentralAuth
database as "shared" databases that need to be joined with local DBs for
many use cases, and treat them as one group. The "En WP + other DB" use
cases are a second group in my opinion, and may not call for all En WP data
to be available elsewhere; only certain tables might be sufficient.
Thanks,
Huji
On Tue, Apr 27, 2021 at 2:34 PM Joaquin Oltra Hernandez <
jhernandez(a)wikimedia.org> wrote:
Hi!
Like I mentioned in the past, I've been cataloging input from developers
and doing some analysis from the random query logger that Brooke coded in
the old cluster to get a better idea of the kinds of cross DB queries being
performed.
Wiki Replicas Cross-DB Query Data
<https://wikitech.wikimedia.org/wiki/News/Wikireplicas_2020_Redesign/Wiki_Replicas_Cross-DB_Query_Data>
The input and random query logger analysis showcase the tools and which
wiki DBs are being used together, but are not perfect, so please if you
know of high impact tools that are going to suffer the breaking changes let
me know so that I can include them.
The most prominent cases that this data shows are
- Querying Commons and another DB (as highlighted in many phab tasks
and conversations, no surprise)
- Querying Wikidata and other DBs (commons, and wikipedias, for
example)
- Querying En Wikipedia and other DBs (wikipedias)
Particularly noticeable is the appearance of arwiki in the analysis. It is
hard to know if there is some bias in the random sampling or tools for this
project do use the replicas a lot for their features. Something to look
into.
Detailed analysis of the tables and fields joined / accessed on will need
to be performed on an individual basis, since the SQL queries can be very
complex and do things in different ways, so automated detection can be
faulty. For example on manual inspection I've seen queries using subqueries
on a different DB with `in`, instead of joins.
This is the reason why the report looks at the cross DB queries and not
cross JOIN queries, to accurately look at all cross DB queries, even if
there can be some false positives which can be worked around in the new
architecture (I think the tools.guc queries for example).
I have created T280152 Mitigate breaking changes from the new Wiki
Replicas architecture <https://phabricator.wikimedia.org/T280152> which
lists the links and can be used to hang tasks for mitigation work if
needed, and sits between the OLAP and the new Wiki Replicas task.
My hope is this information can be used both by developers and WMF teams
to figure out ways to mitigate the breaking changes for the tools that
editors and other users rely on. If you think the published CSV is not
approachable enough and you think certain information on the wiki page
would be useful to have, please let me know and I'll process the data and
post the results in the page for ease of reading.
For example, I was wondering if I should publish the "unique
normalized/stripped" queries per tool to be able to look at the SQL on that
wiki page itself. Thoughts?
--
Joaquin Oltra Hernandez
Developer Advocate - Wikimedia Foundation
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud