Re: [Wikidata] Scaling Wikidata Query Service

11 Jun 2019


      On 6/10/19 3:49 PM, Guillaume Lederrey wrote:
...
On Mon, Jun 10, 2019 at 9:03 PM Sebastian Hellmann
hellmann@informatik.uni-leipzig.de wrote:
...
Hi Guillaume,
On 10.06.19 16:54, Guillaume Lederrey wrote:
Hello!
On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
hellmann@informatik.uni-leipzig.de wrote:
Hi Guillaume,
On 06.06.19 21:32, Guillaume Lederrey wrote:
Hello all!
There has been a number of concerns raised about the performance and
scaling of Wikdata Query Service. We share those concerns and we are
doing our best to address them. Here is some info about what is going
on:
In an ideal world, WDQS should:

scale in terms of data size
scale in terms of number of edits
have low update latency
expose a SPARQL endpoint for queries
allow anyone to run any queries on the public WDQS endpoint
provide great query performance
provide a high level of availability

Scaling graph databases is a "known hard problem", and we are reaching
a scale where there are no obvious easy solutions to address all the
above constraints. At this point, just "throwing hardware at the
problem" is not an option anymore. We need to go deeper into the
details and potentially make major changes to the current architecture.
Some scaling considerations are discussed in [1]. This is going to take
time.
I am not sure how to evaluate this correctly. Scaling databases in general is a "known hard problem" and graph databases a sub-field of it, which are optimized for graph-like queries as opposed to column stores or relational databases. If you say that "throwing hardware at the problem" does not help, you are admitting that Blazegraph does not scale for what is needed by Wikidata.
Yes, I am admitting that Blazegraph (at least in the way we are using
it at the moment) does not scale to our future needs. Blazegraph does
have support for sharding (what they call "Scale Out"). And yes, we
need to have a closer look at how that works. I'm not the expert here,
so I won't even try to assert if that's a viable solution or not.
Yes, sharding is what you need, I think, instead of replication. This is the technique where data is repartitioned into more manageable chunks across servers.
Well, we need sharding for scalability and replication for
availability, so we do need both. The hard problem is sharding.
...
Here is a good explanation of it:
http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF
Interesting read. I don't see how Virtuoso addresses data locality, it
looks like sharding of their RDF store is just hash based (I'm
assuming some kind of uniform hash).
It handles data locality across a shared nothing cluster just fine i.e.,
you can interact with any node in a Virtuoso cluster and experience
identical behavior (everyone node looks like single node in the eyes of
the operator).
...
I'm not enough of an expert on
graph databases, but I doubt that a highly connected graph like
Wikidata will be able to scale reads without some way to address data
locality. Obviously, this needs testing.
...
http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/
There are live instances of Virtuoso that demonstrate its capabilities.
If you want to explore shared-nothing cluster capabilities then our live
LOD Cloud cache is the place to start [1][2][3]. If you want to see the
single-server open source edition that you have DBpedia, DBpedia-Live,
Uniprot and many other nodes in the LOD Cloud to choose from. All of
these instance are highly connected.
If you want to get into the depths of Linked Data regarding query
processing pipelines that include URI (or Super Key) de-reference, you
can take a look at our URIBurner Service [4][5].
Virtuoso handles both shared-nothing clusters and replication i.e., you
can have a cluster configuration used in conjunction with a replication
topology if your solution requires that.
Virtuoso is a full-blown SQL RDBMS that leverages SPARQL and a SQL
extension for handling challenges associated with Entity Relationship
Graphs represented as RDF statement collections. You can even use SPARQL
inside SQL from any ODBC- or JDBC-compliant app or service etc..
Links:
[1] http://lod.openlinksw.com
[2]
https://twitter.com/search?f=tweets&vertical=default&q=%23PermID%20%...
-- query samplings via links included in tweets
[3] https://tinyurl.com/y47prg9h -- SPARQL transitive option applied to
a skos taxonomy tree
[4] https://linkeddata.uriburner.com -- this service provides Linked
Data transformation combined with an ability to de-ref URI-variables and
URI-constants in the body of a query as part of the solution production
pipeline; it also includes a service that adds image processing to the
aforementioned pipeline via the PivotViewer module for data visualization
[5]
https://medium.com/virtuoso-blog/what-is-small-data-and-why-is-it-important-...
-- About Small Data (use of URI-dereference to tackle thorny data access
challenges by leveraging the power of HTTP URIs as Super Keys)
-- 
Regards,

Kingsley Idehen	      
Founder & CEO 
OpenLink Software   
Home Page: http://www.openlinksw.com
Community Support: https://community.openlinksw.com
Weblogs (Blogs):
Company Blog: https://medium.com/openlink-software-blog
Virtuoso Blog: https://medium.com/virtuoso-blog
Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog: https://medium.com/@kidehen
Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
              http://kidehen.blogspot.com

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
        : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Scaling Wikidata Query Service