In the enterprise where I work as a Data Architect, we approach scaling in many ways, but there's no question that the age old technique of SORTING lines up everything for systems and cpu's to massively ingest and pipeline across IO boundaries. Sometimes this involves more indices, lots of duplicated data, and scatter/gather techniques. Knowing WHAT to sort and HOW to sort will vary widely by the queries that are expected to perform well from a system. So sorted indices of many kinds (where data is duplicated) are necessary to achieve extremely fast IO for a broad database such as Wikidata.

Scaling problems can be categorized in a few buckets:

1. Increase of data queries. (READ)

2. Increase of data writes. (WRITE)

3. There is no 3. Because all scale problems boil down to IO (READ/WRITE) and how you approach fast IO.

Google is known to replicate data at different levels of abstraction (metaschema, indices, meta-relations) across entire regions of the world in order to achieve fast IO. With a nearly unlimited budget they can MOVE THINGS FAST certainly and afford to be extremely wasteful and smart with data replication techniques.

IBM approaches the scale problem via Polymorphic stores that support multiple indices, db structures, both in-memory and graph-like. Essentially, duplicating the hell out of the data in many, many ways and wasting space and memory to result in extremely high performance on queries.

https://queue.acm.org/detail.cfm?id=3332266

Juan Sequeda (now at data.world) and team at Capsenta also seem to use polymorpic storage to bridge SPARQL and relational DB's. But I'm unsure of the actual architecture but would love to hear more about it. I've followed Juan for some time.

https://www.zdnet.com/article/data-world-joins-forces-with-capsenta-to-bring-knowledge-graph-based-data-management-and-consumer-grade-ui-to-the-enterprise/

It is unfortunate that Wikidata doesn't have the hardware resources to duplicate and sort data in myriad ways to achieve better scale. On the software(s) side, we all know what the capabilities are of various stacks, but we often don't have the "time" or "hardware" to truly flex the "software" stack muscles to allow fast IO.

Thad

https://www.linkedin.com/in/thadguidry/