Hi!
Yes, sharding is what you need, I think, instead of replication. This is the technique where data is repartitioned into more manageable chunks across servers.
Agreed, if we are to get any solution that is not constrained by hardware limits of a single server, we can not avoid looking at sharding.
Here is a good explanation of it:
http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF
Thanks, very interesting article. I'd certainly would like to know how this works with database in the size of 10 bln. triples and queries both accessing and updating random subsets of them. Updates are not covered very thoroughly there - this is, I suspect, because many databases of 10 bln. size do not have as active (non-append) update workload as we do. Maybe they still manage to solve it, if so, I'd very much like to know about it.
Just a note here: Virtuoso is also a full RDMS, so you could probably keep wikibase db in the same cluster and fix the asynchronicity. That is
Given how the original data is stored (JSON blob inside mysql table) it would not be very useful. In general, graph data model and Wikitext data model on top of which Wikidata is built are very, very different, and expecting same storage to serve both - at least without very major and deep refactoring of the code on both sides - is not currently very realistic. And of course moving any of the wiki production databases to Virtuoso would be a non-starter. Given than original Wikidata database stays on Mysql - which I think is a reasonable assumption - there would need to be a data migration pipeline for data to come from Mysql to whatever is the WDQS NG storage.
also true for any mappers like Sparqlify: http://aksw.org/Projects/Sparqlify.html However, these shift the problem, then you need a sharded/repartitioned relational database....
Yes, relational-RDF bridges are known but my experience is they usually are not very performant (the difference in "you can do it" and "you can do it fast" is sometimes very significant) and in our case, it would be useless anyway as Wikidata data is not really stored in relational database per se - it's stored in JSON blob opaquely saved in relational database structure that knows nothing about Wikidata. Yes, it's not the ideal structure for optimal performance of Wikidata itself, but I do not foresee this changing, at least in any short term. Again, we could of course have data export pipeline to whatever storage format we want - essentially we already have one - but the concept of having single data store is probably not realistic at least within foreseeable timeframes. We use separate data store for search (ElasticSearch) and probably will have to have separate one for queries, whatever would be the mechanism.
Thanks,