Re: [Wikidata] Scaling Wikidata Query Service

10 Jun 2019


      Hi!
...
Yes, sharding is what you need, I think, instead of replication. This is
the technique where data is repartitioned into more manageable chunks
across servers.
Agreed, if we are to get any solution that is not constrained by
hardware limits of a single server, we can not avoid looking at sharding.
...
Here is a good explanation of it:
http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF
Thanks, very interesting article. I'd certainly would like to know how
this works with database in the size of 10 bln. triples and queries both
accessing and updating random subsets of them. Updates are not covered
very thoroughly there - this is, I suspect, because many databases of 10
bln. size do not have as active (non-append) update workload as we do.
Maybe they still manage to solve it, if so, I'd very much like to know
about it.
...
Just a note here: Virtuoso is also a full RDMS, so you could probably
keep wikibase db in the same cluster and fix the asynchronicity. That is
Given how the original data is stored (JSON blob inside mysql table) it
would not be very useful. In general, graph data model and Wikitext data
model on top of which Wikidata is built are very, very different, and
expecting same storage to serve both - at least without very major and
deep refactoring of the code on both sides - is not currently very
realistic. And of course moving any of the wiki production databases to
Virtuoso would be a non-starter. Given than original Wikidata database
stays on Mysql - which I think is a reasonable assumption - there would
need to be a data migration pipeline for data to come from Mysql to
whatever is the WDQS NG storage.
...
also true for any mappers like Sparqlify:
http://aksw.org/Projects/Sparqlify.html However, these shift the
problem, then you need a sharded/repartitioned relational database....
Yes, relational-RDF bridges are known but my experience is they usually
are not very performant (the difference in "you can do it" and "you can
do it fast" is sometimes very significant) and in our case, it would be
useless anyway as Wikidata data is not really stored in relational
database per se - it's stored in JSON blob opaquely saved in relational
database structure that knows nothing about Wikidata. Yes, it's not the
ideal structure for optimal performance of Wikidata itself, but I do not
foresee this changing, at least in any short term. Again, we could of
course have data export pipeline to whatever storage format we want -
essentially we already have one - but the concept of having single data
store is probably not realistic at least within foreseeable timeframes.
We use separate data store for search (ElasticSearch) and probably will
have to have separate one for queries, whatever would be the mechanism.
Thanks,
-- 
Stas Malyshev
smalyshev@wikimedia.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Scaling Wikidata Query Service