So I think there is something here. Different people have different needs, so far the number one need for wikireplicas is for those that needed underlying access to an "almost real time" copy of the internal database structure, as is. This is based on the fact that latency is the most common complaint regarding wikireplicas.
The issue is that there is 3 properties we can play with: #1 Having a complete dataset #2 Having data updated as soon as production #3 Continue using the same api and SQL syntax for backwards compatibility #4 Being able to query everything at the same time (data lake)
With the current technology used by wikireplicas, and the growth experimented in the last years, one has to sacrifice one of the above. Since 2013, and wikidata and commons popularity has exploded, in addition to getting more features and data per edit. The natural decision is to keep #1, #2 and #3 and sacrifice #4, especially because it will also reduce latency as an unintended consequence.
That doesn't mean that #4 is impossible, but it would need either (probably more than one): a) be precise about what subset of the data is needed to have it consolidated (e.g. only some tables exposed) b) load static dumps that are not updated in real time (e.g. only once a month) c) stop using MySQL/InnoDB and use an OLAP engine, like a column-based storage or something more analytic-y
Keeping the current technology is the easiest path to achieve #1, #2, and #3 short term, but the data size and load make #4 impossible- it no longer "fits" on a single db with MySQL/MariaDB. But I think if someone had a concrete proposal and provided feedback to achieve #4 on a separate service, people would listen- for example, I have thought about proposing setting up an analytics engine loaded every week or every month from backups with a subset of the data, but would need people providing feedback on what data would be useful to expose (e.g. the previous email about fawiki and enwiki image usage)?
I propose to open a ticket to discuss architecture and technical solutions on Phabricator- if you see it productive, and where more people can express interest in moving it forward- and not just me.
PS: The federated approach of the old tools db didn't work well back them, and won't work well now, specially with such large tables, and it has big security implications