Re: [Wikidata] Scaling Wikidata Query Service

11 Jun 2019

Hi Guillaume, All,

As the lead developer for sparql.uniprot.org one of the few sparql 
endpoints with much more data (7x) than wikidata and significant 
external users. I can chime in with our experiences of hosting data with 
Virtuoso. All in all, I am very happy with it and it has made our 
endpoint possible and useful at a shoe string budget.

We like WikiData have an async loading process and allow anyone to run
analytics queries on our SPARQL endpoint with generous timeouts.

We have two servers each with 256GB of ram and 8TB of raw SSD space 
(consumer). These are whitebox AMD machines from 2014 and the main cost 
at the time was the RAM. The setup was relatively cheap (cheaper than 
what is documented at 
https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Hardwa…) 

Even in 2014 we already had more data than you do now.

There is a third multi use server which does the loading of data 
offline. This is now a larger new Epyc server with more ram and more SSD 
but used for much more than just the RDF loading).

Unlike, most sites we do have our own custom frontend in front of 
virtuoso. We did this to allow more styling, as well as being flexible 
and change implementations at our whim. e.g. we double parse the SPARQL 
queries and even rewrite some to be friendlier. I suggest you do the 
same no matter which DB you use in the end, and we would be willing to 
open source ours (it is in Java, and uses RDF4J and some ugly JSPX but 
it works, if not to use at least as an inspiration). We did this to 
avoid being locked into endpoint specific features.

We use the opensource edition of virtuoso, and do not need the sharding 
etc. features. We use the CAIS (Cheap Array Of Independent Servers ;) 
approach to resilience. OpenlinkSW behind Virtuoso can deliver support 
for the OpenSource edition and if you are interested I suggest you talk 
to them.

Virtuoso 7 has become very resilient over the years, and does not need 
much hand-holding anymore (in 2015 this was different). Of course we 
have aggressive auto-restart code but this is rarely triggered these 
days. While the inbound queries are getting more complex.

Some of the tricks you have build into WQS are going to be a pain to 
redo in virtuoso. But I don't see anything impossible there.

Pragmatically, while WDS is a Graph database, the queries are actually 
very relational. And none of the standard graph algorithms are used. To 
be honest RDF is actually a relational system which means that 
relational techniques are very good at answering them. The sole issue is
recursive queries (e.g. rdfs:subClassOf+) in which the virtuoso 
implementation is adequate but not great.

This is why recovering physical schemata from RDF data is such a 
powerful optimization technique [1]. i.e. you tend to do joins not 
traversals. This is not always true but I strongly suspect it will hold
for the vast majority of the Wikidata Query Service case.

I hope this was helpful, and I am willing to answer further questions.

Regards,
Jerven

[1] https://research.vu.nl/files/61555276/complete%20dissertation.pdf
  and associated work that was done by Orri Erling. Which unfortunately 
has not yet landed in the Virtuoso master branch.

On 6/10/19 9:49 PM, Guillaume Lederrey wrote:
...
  On Mon, Jun 10, 2019 at 9:03 PM Sebastian Hellmann
 &lt;hellmann(a)informatik.uni-leipzig.de&gt; wrote:

 Hi Guillaume,

 On 10.06.19 16:54, Guillaume Lederrey wrote:

 Hello!

 On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
 &lt;hellmann(a)informatik.uni-leipzig.de&gt; wrote:

 Hi Guillaume,

 On 06.06.19 21:32, Guillaume Lederrey wrote:

 Hello all!

 There has been a number of concerns raised about the performance and
 scaling of Wikdata Query Service. We share those concerns and we are
 doing our best to address them. Here is some info about what is going
 on:

 In an ideal world, WDQS should:

 * scale in terms of data size
 * scale in terms of number of edits
 * have low update latency
 * expose a SPARQL endpoint for queries
 * allow anyone to run any queries on the public WDQS endpoint
 * provide great query performance
 * provide a high level of availability

 Scaling graph databases is a "known hard problem", and we are reaching
 a scale where there are no obvious easy solutions to address all the
 above constraints. At this point, just "throwing hardware at the
 problem" is not an option anymore. We need to go deeper into the
 details and potentially make major changes to the current architecture.
 Some scaling considerations are discussed in [1]. This is going to take
 time.

 I am not sure how to evaluate this correctly. Scaling databases in general is a
"known hard problem" and graph databases a sub-field of it, which are optimized
for graph-like queries as opposed to column stores or relational databases. If you say
that "throwing hardware at the problem" does not help, you are admitting that
Blazegraph does not scale for what is needed by Wikidata.

 Yes, I am admitting that Blazegraph (at least in the way we are using
 it at the moment) does not scale to our future needs. Blazegraph does
 have support for sharding (what they call "Scale Out"). And yes, we
 need to have a closer look at how that works. I'm not the expert here,
 so I won't even try to assert if that's a viable solution or not.

 Yes, sharding is what you need, I think, instead of replication. This is the technique
where data is repartitioned into more manageable chunks across servers.  
 Well, we need sharding for scalability and replication for
 availability, so we do need both. The hard problem is sharding.

  Here is a good explanation of it:

 http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF  
 Interesting read. I don't see how Virtuoso addresses data locality, it
 looks like sharding of their RDF store is just hash based (I'm
 assuming some kind of uniform hash). I'm not enough of an expert on
 graph databases, but I doubt that a highly connected graph like
 Wikidata will be able to scale reads without some way to address data
 locality. Obviously, this needs testing.

http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/

 Sharding, scale-out or repartitioning is a classical enterprise feature for Open-source
databases. I am rather surprised that Blazegraph is full GPL without an enterprise
edition. But then they really sounded like their goal as a company was to be bought by a
bigger fish, in this case Amazon Web Services. What is their deal? They are offering
support?

 So if you go open-source, I think you will have a hard time finding good free databases
sharding/repartition. FoundationDB as proposed in the grant [1]is from Apple

 [1] https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB

 I mean try the sharding feature. At some point though it might be worth considering to go
enterprise. Corporate Open Source often has a twist.  
 Closed source is not an option. We have strong open source
 requirements to deploy anything in our production environment.

  Just a note here: Virtuoso is also a full RDMS,
so you could probably keep wikibase db in the same cluster and fix the asynchronicity.
That is also true for any mappers like Sparqlify: http://aksw.org/Projects/Sparqlify.html
However, these shift the problem, then you need a sharded/repartitioned relational
database....  
 There is no plan to move the Wikibase storage out of MySQL at the
 moment. In any case, having a low coupling between the primary storage
 for wikidata and a secondary storage for complex querying is a sound
 architectural principle. This asynchronous update process is most
 probably going to stay in place, just because it makes a lot of sense.

 Thanks for the discussion so far! It is always interesting to have outside idea!

     Have fun!

       Guillaume

 All the best,

 Sebastian

  From [1]:

 At the moment, each WDQS cluster is a group of independent servers, sharing nothing, with
each server independently updated and each server holding a full data set.

 Then it is not a "cluster" in the sense of databases. It is more a redundancy
architecture like RAID 1. Is this really how BlazeGraph does it? Don't they have a
proper cluster solution, where they repartition data across servers? Or is this
independent servers a wikimedia staff homebuild?

 It all depends on your definition of a cluster. We have groups of
 machine collectively serving some coherent traffic, but each machine
 is completely independent from others. So yes, the comparison to RAID1
 is adequate.

 Some info here:

 - We evaluated some stores according to their performance:
http://www.semantic-web-journal.net/content/evaluation-metadata-representat…
 "Evaluation of Metadata Representations in RDF stores"

 Thanks for the link! That looks quite interesting!

 - Virtuoso has proven quite useful. I don't want to advertise here, but the thing
they have going for DBpedia uses ridiculous hardware, i.e. 64GB RAM and it is also the OS
version, not the professional with clustering and repartition capability. So we are
playing the game since ten years now: Everybody tries other databases, but then most
people come back to virtuoso. I have to admit that OpenLink is maintaining the hosting for
DBpedia themselves, so they know how to optimise. They normally do large banks as
customers with millions of write transactions per hour. In LOD2 they also implemented
column store features with MonetDB and repartitioning in clusters.

 I'm not entirely sure how to read the above (and a quick look at
 virtuoso website does not give me the answer either), but it looks
 like the sharding / partitioning options are only available in the
 enterprise version. That probably makes it a non starter for us.

 - I recently heard a presentation from Arango-DB and they had a good cluster concept as
well, although I don't know anybody who tried it. The slides seemed to make sense.

 Nice, another one to add to our list of options to test.

 All the best,

 Sebastian

 Reasonably, addressing all of the above constraints is unlikely to
 ever happen. Some of the constraints are non negotiable: if we can't
 keep up with Wikidata in term of data size or number of edits, it does
 not make sense to address query performance. On some constraints, we
 will probably need to compromise.

 For example, the update process is asynchronous. It is by nature
 expected to lag. In the best case, this lag is measured in minutes,
 but can climb to hours occasionally. This is a case of prioritizing
 stability and correctness (ingesting all edits) over update latency.
 And while we can work to reduce the maximum latency, this will still
 be an asynchronous process and needs to be considered as such.

 We currently have one Blazegraph expert working with us to address a
 number of performance and stability issues. We
 are planning to hire an additional engineer to help us support the
 service in the long term. You can follow our current work in phabricator [2].

 If anyone has experience with scaling large graph databases, please
 reach out to us, we're always happy to share ideas!

 Thanks all for your patience!

     Guillaume

 [1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy
 [2] https://phabricator.wikimedia.org/project/view/1239/

 --
 All the best,
 Sebastian Hellmann

 Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
 at the Institute for Applied Informatics (InfAI) at Leipzig University
 Executive Director of the DBpedia Association
 Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org,
https://www.w3.org/community/ld4lt
 Homepage: http://aksw.org/SebastianHellmann
 Research Group: http://aksw.org

 --
 All the best,
 Sebastian Hellmann

 Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
 at the Institute for Applied Informatics (InfAI) at Leipzig University
 Executive Director of the DBpedia Association
 Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org,
https://www.w3.org/community/ld4lt
 Homepage: http://aksw.org/SebastianHellmann
 Research Group: http://aksw.org  

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Scaling Wikidata Query Service