In the enterprise where I work as a Data Architect, we approach scaling in
many ways, but there's no question that the age old technique of SORTING
lines up everything for systems and cpu's to massively ingest and pipeline
across IO boundaries. Sometimes this involves more indices, lots of
duplicated data, and scatter/gather techniques. Knowing WHAT to sort and
HOW to sort will vary widely by the queries that are expected to perform
well from a system. So sorted indices of many kinds (where data is
duplicated) are necessary to achieve extremely fast IO for a broad database
such as Wikidata.
Scaling problems can be categorized in a few buckets:
1. Increase of data queries. (READ)
2. Increase of data writes. (WRITE)
3. There is no 3. Because all scale problems boil down to IO (READ/WRITE)
and how you approach fast IO.
Google is known to replicate data at different levels of abstraction
(metaschema, indices, meta-relations) across entire regions of the world in
order to achieve fast IO. With a nearly unlimited budget they can MOVE
THINGS FAST certainly and afford to be extremely wasteful and smart with
data replication techniques.
IBM approaches the scale problem via Polymorphic stores that support
multiple indices, db structures, both in-memory and graph-like.
Essentially, duplicating the hell out of the data in many, many ways and
wasting space and memory to result in extremely high performance on queries.
https://queue.acm.org/detail.cfm?id=3332266
Juan Sequeda (now at data.world) and team at Capsenta also seem to use
polymorpic storage to bridge SPARQL and relational DB's. But I'm unsure of
the actual architecture but would love to hear more about it. I've
followed Juan for some time.
https://www.zdnet.com/article/data-world-joins-forces-with-capsenta-to-brin…
It is unfortunate that Wikidata doesn't have the hardware resources to
duplicate and sort data in myriad ways to achieve better scale. On the
software(s) side, we all know what the capabilities are of various stacks,
but we often don't have the "time" or "hardware" to truly flex
the
"software" stack muscles to allow fast IO.
Thad
https://www.linkedin.com/in/thadguidry/