On 6/10/19 3:49 PM, Guillaume Lederrey wrote:
On Mon, Jun 10, 2019 at 9:03 PM Sebastian Hellmann
<hellmann@informatik.uni-leipzig.de> wrote:
Hi Guillaume,

On 10.06.19 16:54, Guillaume Lederrey wrote:

Hello!

On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
<hellmann@informatik.uni-leipzig.de> wrote:

Hi Guillaume,

On 06.06.19 21:32, Guillaume Lederrey wrote:

Hello all!

There has been a number of concerns raised about the performance and
scaling of Wikdata Query Service. We share those concerns and we are
doing our best to address them. Here is some info about what is going
on:

In an ideal world, WDQS should:

* scale in terms of data size
* scale in terms of number of edits
* have low update latency
* expose a SPARQL endpoint for queries
* allow anyone to run any queries on the public WDQS endpoint
* provide great query performance
* provide a high level of availability

Scaling graph databases is a "known hard problem", and we are reaching
a scale where there are no obvious easy solutions to address all the
above constraints. At this point, just "throwing hardware at the
problem" is not an option anymore. We need to go deeper into the
details and potentially make major changes to the current architecture.
Some scaling considerations are discussed in [1]. This is going to take
time.

I am not sure how to evaluate this correctly. Scaling databases in general is a "known hard problem" and graph databases a sub-field of it, which are optimized for graph-like queries as opposed to column stores or relational databases. If you say that "throwing hardware at the problem" does not help, you are admitting that Blazegraph does not scale for what is needed by Wikidata.

Yes, I am admitting that Blazegraph (at least in the way we are using
it at the moment) does not scale to our future needs. Blazegraph does
have support for sharding (what they call "Scale Out"). And yes, we
need to have a closer look at how that works. I'm not the expert here,
so I won't even try to assert if that's a viable solution or not.

Yes, sharding is what you need, I think, instead of replication. This is the technique where data is repartitioned into more manageable chunks across servers.
Well, we need sharding for scalability and replication for
availability, so we do need both. The hard problem is sharding.

Here is a good explanation of it:

http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF
Interesting read. I don't see how Virtuoso addresses data locality, it
looks like sharding of their RDF store is just hash based (I'm
assuming some kind of uniform hash).


It handles data locality across a shared nothing cluster just fine i.e., you can interact with any node in a Virtuoso cluster and experience identical behavior (everyone node looks like single node in the eyes of the operator).


 I'm not enough of an expert on
graph databases, but I doubt that a highly connected graph like
Wikidata will be able to scale reads without some way to address data
locality. Obviously, this needs testing.

http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/


There are live instances of Virtuoso that demonstrate its capabilities. If you want to explore shared-nothing cluster capabilities then our live LOD Cloud cache is the place to start [1][2][3]. If you want to see the single-server open source edition that you have DBpedia, DBpedia-Live, Uniprot and many other nodes in the LOD Cloud to choose from. All of these instance are highly connected.

If you want to get into the depths of Linked Data regarding query processing pipelines that include URI (or Super Key) de-reference, you can take a look at our URIBurner Service [4][5].

Virtuoso handles both shared-nothing clusters and replication i.e., you can have a cluster configuration used in conjunction with a replication topology if your solution requires that.

Virtuoso is a full-blown SQL RDBMS that leverages SPARQL and a SQL extension for handling challenges associated with Entity Relationship Graphs represented as RDF statement collections. You can even use SPARQL inside SQL from any ODBC- or JDBC-compliant app or service etc..


Links:

[1] http://lod.openlinksw.com

[2] https://twitter.com/search?f=tweets&vertical=default&q=%23PermID%20%40kidehen&src=typd -- query samplings via links included in tweets

[3] https://tinyurl.com/y47prg9h -- SPARQL transitive option applied to a skos taxonomy tree

[4] https://linkeddata.uriburner.com -- this service provides Linked Data transformation combined with an ability to de-ref URI-variables and URI-constants in the body of a query as part of the solution production pipeline; it also includes a service that adds image processing to the aforementioned pipeline via the PivotViewer module for data visualization

[5] https://medium.com/virtuoso-blog/what-is-small-data-and-why-is-it-important-fbf5f267884 -- About Small Data (use of URI-dereference to tackle thorny data access challenges by leveraging the power of HTTP URIs as Super Keys)


-- 
Regards,

Kingsley Idehen	      
Founder & CEO 
OpenLink Software   
Home Page: http://www.openlinksw.com
Community Support: https://community.openlinksw.com
Weblogs (Blogs):
Company Blog: https://medium.com/openlink-software-blog
Virtuoso Blog: https://medium.com/virtuoso-blog
Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog: https://medium.com/@kidehen
Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
              http://kidehen.blogspot.com

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
        : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this