Wiki replicas: Compilation of input and logged cross-DB queries - Cloud

27 Apr 2021

Hi!

Like I mentioned in the past, I've been cataloging input from developers
and doing some analysis from the random query logger that Brooke coded in
the old cluster to get a better idea of the kinds of cross DB queries being
performed.

Wiki Replicas Cross-DB Query Data
<https://wikitech.wikimedia.org/wiki/News/Wikireplicas_2020_Redesign/Wiki_Replicas_Cross-DB_Query_Data>

The input and random query logger analysis showcase the tools and which
wiki DBs are being used together, but are not perfect, so please if you
know of high impact tools that are going to suffer the breaking changes let
me know so that I can include them.

The most prominent cases that this data shows are

   - Querying Commons and another DB (as highlighted in many phab tasks and
   conversations, no surprise)
   - Querying Wikidata and other DBs (commons, and wikipedias, for example)
   - Querying En Wikipedia and other DBs (wikipedias)

Particularly noticeable is the appearance of arwiki in the analysis. It is
hard to know if there is some bias in the random sampling or tools for this
project do use the replicas a lot for their features. Something to look
into.

Detailed analysis of the tables and fields joined / accessed on will need
to be performed on an individual basis, since the SQL queries can be very
complex and do things in different ways, so automated detection can be
faulty. For example on manual inspection I've seen queries using subqueries
on a different DB with `in`, instead of joins.

This is the reason why the report looks at the cross DB queries and not
cross JOIN queries, to accurately look at all cross DB queries, even if
there can be some false positives which can be worked around in the new
architecture (I think the tools.guc queries for example).

I have created T280152 Mitigate breaking changes from the new Wiki Replicas
architecture <https://phabricator.wikimedia.org/T280152> which lists the
links and can be used to hang tasks for mitigation work if needed, and sits
between the OLAP and the new Wiki Replicas task.

My hope is this information can be used both by developers and WMF teams to
figure out ways to mitigate the breaking changes for the tools that editors
and other users rely on. If you think the published CSV is not approachable
enough and you think certain information on the wiki page would be useful
to have, please let me know and I'll process the data and post the results
in the page for ease of reading.

For example, I was wondering if I should publish the "unique
normalized/stripped" queries per tool to be able to look at the SQL on that
wiki page itself. Thoughts?
-- 
Joaquin Oltra Hernandez
Developer Advocate - Wikimedia Foundation