Re: [Wikidata] Scaling Wikidata Query Service

10 Jun 2019


      Hi Guillaume,
On 06.06.19 21:32, Guillaume Lederrey wrote:
...
Hello all!
There has been a number of concerns raised about the performance and
scaling of Wikdata Query Service. We share those concerns and we are
doing our best to address them. Here is some info about what is going
on:
In an ideal world, WDQS should:

scale in terms of data size
scale in terms of number of edits
have low update latency
expose a SPARQL endpoint for queries
allow anyone to run any queries on the public WDQS endpoint
provide great query performance
provide a high level of availability

Scaling graph databases is a "known hard problem", and we are reaching
a scale where there are no obvious easy solutions to address all the
above constraints. At this point, just "throwing hardware at the
problem" is not an option anymore. We need to go deeper into the
details and potentially make major changes to the current architecture.
Some scaling considerations are discussed in [1]. This is going to take
time.
I am not sure how to evaluate this correctly. Scaling databases in 
general is a "known hard problem" and graph databases a sub-field of it, 
which are optimized for graph-like queries as opposed to column stores 
or relational databases. If you say that "throwing hardware at the 
problem" does not help, you are admitting that Blazegraph does not scale 
for what is needed by Wikidata.
From [1]:
...
At the moment, each WDQS cluster is a group of independent servers, 
sharing nothing, with each server independently updated and each 
server holding a full data set.
Then it is not a "cluster" in the sense of databases. It is more a 
redundancy architecture like RAID 1. Is this really how BlazeGraph does 
it? Don't they have a proper cluster solution, where they repartition 
data across servers? Or is this independent servers a wikimedia staff 
homebuild?
Some info here:
- We evaluated some stores according to their performance: 
http://www.semantic-web-journal.net/content/evaluation-metadata-representati... 
"Evaluation of Metadata Representations in RDF stores"
- Virtuoso has proven quite useful. I don't want to advertise here, but 
the thing they have going for DBpedia uses ridiculous hardware, i.e. 
64GB RAM and it is also the OS version, not the professional with 
clustering and repartition capability. So we are playing the game since 
ten years now: Everybody tries other databases, but then most people 
come back to virtuoso. I have to admit that OpenLink is maintaining the 
hosting for DBpedia themselves, so they know how to optimise. They 
normally do large banks as customers with millions of write transactions 
per hour. In LOD2 they also implemented column store features with 
MonetDB and repartitioning in clusters.
- I recently heard a presentation from Arango-DB and they had a good 
cluster concept as well, although I don't know anybody who tried it. The 
slides seemed to make sense.
All the best,
Sebastian
...
Reasonably, addressing all of the above constraints is unlikely to
ever happen. Some of the constraints are non negotiable: if we can't
keep up with Wikidata in term of data size or number of edits, it does
not make sense to address query performance. On some constraints, we
will probably need to compromise.
For example, the update process is asynchronous. It is by nature
expected to lag. In the best case, this lag is measured in minutes,
but can climb to hours occasionally. This is a case of prioritizing
stability and correctness (ingesting all edits) over update latency.
And while we can work to reduce the maximum latency, this will still
be an asynchronous process and needs to be considered as such.
We currently have one Blazegraph expert working with us to address a
number of performance and stability issues. We
are planning to hire an additional engineer to help us support the
service in the long term. You can follow our current work in phabricator [2].
If anyone has experience with scaling large graph databases, please
reach out to us, we're always happy to share ideas!
Thanks all for your patience!
Guillaume


[1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy
[2] https://phabricator.wikimedia.org/project/view/1239/
-- 
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 
http://www.w3.org/community/ld4lt
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Scaling Wikidata Query Service