On 6/10/19 3:49 PM, Guillaume Lederrey wrote:
On Mon, Jun 10, 2019 at 9:03 PM Sebastian Hellmann hellmann@informatik.uni-leipzig.de wrote:
Hi Guillaume,
On 10.06.19 16:54, Guillaume Lederrey wrote:
Hello!
On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann hellmann@informatik.uni-leipzig.de wrote:
Hi Guillaume,
On 06.06.19 21:32, Guillaume Lederrey wrote:
Hello all!
There has been a number of concerns raised about the performance and scaling of Wikdata Query Service. We share those concerns and we are doing our best to address them. Here is some info about what is going on:
In an ideal world, WDQS should:
- scale in terms of data size
- scale in terms of number of edits
- have low update latency
- expose a SPARQL endpoint for queries
- allow anyone to run any queries on the public WDQS endpoint
- provide great query performance
- provide a high level of availability
Scaling graph databases is a "known hard problem", and we are reaching a scale where there are no obvious easy solutions to address all the above constraints. At this point, just "throwing hardware at the problem" is not an option anymore. We need to go deeper into the details and potentially make major changes to the current architecture. Some scaling considerations are discussed in [1]. This is going to take time.
I am not sure how to evaluate this correctly. Scaling databases in general is a "known hard problem" and graph databases a sub-field of it, which are optimized for graph-like queries as opposed to column stores or relational databases. If you say that "throwing hardware at the problem" does not help, you are admitting that Blazegraph does not scale for what is needed by Wikidata.
Yes, I am admitting that Blazegraph (at least in the way we are using it at the moment) does not scale to our future needs. Blazegraph does have support for sharding (what they call "Scale Out"). And yes, we need to have a closer look at how that works. I'm not the expert here, so I won't even try to assert if that's a viable solution or not.
Yes, sharding is what you need, I think, instead of replication. This is the technique where data is repartitioned into more manageable chunks across servers.
Well, we need sharding for scalability and replication for availability, so we do need both. The hard problem is sharding.
Here is a good explanation of it:
http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF
Interesting read. I don't see how Virtuoso addresses data locality, it looks like sharding of their RDF store is just hash based (I'm assuming some kind of uniform hash).
It handles data locality across a shared nothing cluster just fine i.e., you can interact with any node in a Virtuoso cluster and experience identical behavior (everyone node looks like single node in the eyes of the operator).
I'm not enough of an expert on graph databases, but I doubt that a highly connected graph like Wikidata will be able to scale reads without some way to address data locality. Obviously, this needs testing.
There are live instances of Virtuoso that demonstrate its capabilities. If you want to explore shared-nothing cluster capabilities then our live LOD Cloud cache is the place to start [1][2][3]. If you want to see the single-server open source edition that you have DBpedia, DBpedia-Live, Uniprot and many other nodes in the LOD Cloud to choose from. All of these instance are highly connected.
If you want to get into the depths of Linked Data regarding query processing pipelines that include URI (or Super Key) de-reference, you can take a look at our URIBurner Service [4][5].
Virtuoso handles both shared-nothing clusters and replication i.e., you can have a cluster configuration used in conjunction with a replication topology if your solution requires that.
Virtuoso is a full-blown SQL RDBMS that leverages SPARQL and a SQL extension for handling challenges associated with Entity Relationship Graphs represented as RDF statement collections. You can even use SPARQL inside SQL from any ODBC- or JDBC-compliant app or service etc..
Links:
[2] https://twitter.com/search?f=tweets&vertical=default&q=%23PermID%20%... -- query samplings via links included in tweets
[3] https://tinyurl.com/y47prg9h -- SPARQL transitive option applied to a skos taxonomy tree
[4] https://linkeddata.uriburner.com -- this service provides Linked Data transformation combined with an ability to de-ref URI-variables and URI-constants in the body of a query as part of the solution production pipeline; it also includes a service that adds image processing to the aforementioned pipeline via the PivotViewer module for data visualization
[5] https://medium.com/virtuoso-blog/what-is-small-data-and-why-is-it-important-... -- About Small Data (use of URI-dereference to tackle thorny data access challenges by leveraging the power of HTTP URIs as Super Keys)