So I think there is something here. Different people have different
needs, so far the number one need for wikireplicas is for those that
needed underlying access to an "almost real time" copy of the internal
database structure, as is. This is based on the fact that latency is
the most common complaint regarding wikireplicas.
The issue is that there is 3 properties we can play with:
#1 Having a complete dataset
#2 Having data updated as soon as production
#3 Continue using the same api and SQL syntax for backwards compatibility
#4 Being able to query everything at the same time (data lake)
With the current technology used by wikireplicas, and the growth
experimented in the last years, one has to sacrifice one of the above.
Since 2013, and wikidata and commons popularity has exploded, in
addition to getting more features and data per edit. The natural
decision is to keep #1, #2 and #3 and sacrifice #4, especially because
it will also reduce latency as an unintended consequence.
That doesn't mean that #4 is impossible, but it would need either
(probably more than one):
a) be precise about what subset of the data is needed to have it
consolidated (e.g. only some tables exposed)
b) load static dumps that are not updated in real time (e.g. only once a month)
c) stop using MySQL/InnoDB and use an OLAP engine, like a column-based
storage or something more analytic-y
Keeping the current technology is the easiest path to achieve #1, #2,
and #3 short term, but the data size and load make #4 impossible- it
no longer "fits" on a single db with MySQL/MariaDB. But I think if
someone had a concrete proposal and provided feedback to achieve #4 on
a separate service, people would listen- for example, I have thought
about proposing setting up an analytics engine loaded every week or
every month from backups with a subset of the data, but would need
people providing feedback on what data would be useful to expose (e.g.
the previous email about fawiki and enwiki image usage)?
I propose to open a ticket to discuss architecture and technical
solutions on Phabricator- if you see it productive, and where more
people can express interest in moving it forward- and not just me.
PS: The federated approach of the old tools db didn't work well back
them, and won't work well now, specially with such large tables, and
it has big security implications