Re: [Wikidata] Scaling Wikidata Query Service

17 Jun 2019


      Hello, Stas --
On Jun 13, 2019, at 07:52 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
...
Hi!
...
It handles data locality across a shared nothing cluster just fine i.e., you can interact with any node in a Virtuoso cluster and experience identical behavior (everyone node looks like single node in the eyes of the operator).
Does this mean no sharding, i.e. each server stores the full DB?
No.
The full DB is automatically sharded across all Virtuoso instances in an Elastic Cluster, and each instance *appears* to store the full DB -- i.e., you can issue a query to any instance in an Elastic Cluster, if you have the relevant communication details (typically IP address and port number), and you will get the same results from it as from any other instance in that Elastic Cluster.
(I am generally specific about Elastic Cluster vs Replication Cluster, because these are different though complementary technologies, implemented via different Modules in Virtuoso.)
...
This is the model we're using currently, but given the growth of the data it may be non sustainable on current hardware. I see in your tables that Uniprot has about 30B triples, but I wonder how update loads there look like. Our main issue is that the hardware we have now is showing its limits when there's a lot of updates in parallel to significant query load. So I wonder if the "single server holds everything" model is sustainable in the long term.
Your questions are unsurprising, and are one of the reasons for the benchmark efforts of the LDBC --
http://ldbcouncil.org/benchmarks/
Uniprot does not get a lot of updates, and it is running on a single instance -- i.e., there's no cluster involved at all, neither Elastic (Shared-Nothing) Cluster nor Replication Cluster -- so its probably not the best example for your workflows.
I think the LDBC's Social Networking Benchmark (SNB) is likely to be the closest to the Wikidata update and query patterns, so you may find these articles interesting --
1. SNB Interactive, Part 1: What is SNB Interactive Really About?
   https://virtuoso.openlinksw.com/blog/vdb/blog/?id=1835
2. SNB Interactive, Part 2: Modeling Choices
   https://virtuoso.openlinksw.com/blog/vdb/blog/?id=1837
3. SNB Interactive, Part 3: Choke Points and Initial Run on Virtuoso
   https://virtuoso.openlinksw.com/blog/vdb/blog/?id=1842
...
...
There are live instances of Virtuoso that demonstrate its capabilities. If you want to explore shared-nothing cluster capabilities then our live LOD Cloud cache is the place to start [1][2][3]. If you want to see the single-server open source edition that you have DBpedia, DBpedia-Live, Uniprot and many other nodes in the LOD Cloud to choose from. All of these instance are highly connected.
Again, here the question is not too much in "can you load 7bn triples into Virtuoso" - we know we can. What we want to figure out whether given specific query/update patterns we have now - it is going to give us significantly better performance allowing to support our projected growth. And also possibly whether Virtuoso has ways to make our update workflow be more optimal - e.g. right now if one triple changes in Wikidata item, we're essentially downloading and updating the whole item (not exactly since triples that stay the same are preserved but it requires a lot of data transfer to express that in SPARQL). Would there be ways to update the things more efficiently?
The first thing that will improve your performance is to break out of the "stored as JSON blobs" pattern you've been using.
Updates should not require a full download of the named graph (which I think is what your JSON Blobs amount to) followed by an upload of the entire revised named graph.
Even if you *query* the full content of an existing named graph, determine the necessary changes locally, and then submit an update query which includes a full set of DELETE + INSERT statements (this "full set" only including the *changed* triples), you should find a significant reduction in data throughput.
The live parallel to such regular updates is DBpedia-Live, which started from a static load of dump files, and has been (and is still) continuously updated by an RDF feed based on the Wikipedia update firehose.  The same RDF feed is made available to users of our AMI-based DBpedia-Live mirror AMI (currently being refreshed, and soon to be made available for new users) --
https://aws.amazon.com/marketplace/pp/B012DSCFEK
...
...
Virtuoso handles both shared-nothing clusters and replication i.e., you can have a cluster configuration used in conjunction with a replication topology if your solution requires that.
Replication could certainly be useful I think it it's faster to update single server and then replicate than simultaneously update all servers (that's what is happening now).
There are multiple Replication strategies which might be used, as well as multiple Replication Cluster topologies which might be considered, and none of them is inherently the fastest.
That said, periodic monolithic replication of an entire dataset or DB would certainly not be faster than propagation of DIFFs from the master to the replica(s).  Replication via periodic cumulative DIFFs *may* be faster than incremental DIFFs that are dispatched after every change, but this depends on many variables.
This page of cluster topology diagrams starts with Replication-only and progresses to Elastic-only.  (There are no illustrations of a combined Replicating-Elastic-Cluster on this page.)
http://vos.openlinksw.com/owiki/wiki/VOS/VirtClusteringDiagrams
Any Replication Cluster topology and methodology -- including zero Replication -- may be combined with an Elastic (Shared-Nothing) Cluster setup.  Generally speaking, when these are combined, an entire Elastic Cluster would take the place of each Single-Server Instance in a given Replication topology.
I hope this helps your understanding of the available options.
Ted
--
A: Yes.                          http://www.idallen.com/topposting.html
| Q: Are you sure?           
| | A: Because it reverses the logical flow of conversation.
| | | Q: Why is top posting frowned upon?
Ted Thibodeau, Jr.           //               voice +1-781-273-0900 x32
Senior Support & Evangelism  //        mailto:tthibodeau@openlinksw.com
                             //              http://twitter.com/TallTed
OpenLink Software, Inc.      //              http://www.openlinksw.com/
         20 Burlington Mall Road, Suite 322, Burlington MA 01803
     Weblog    -- http://www.openlinksw.com/blogs/
     Community -- https://community.openlinksw.com/
     LinkedIn  -- http://www.linkedin.com/company/openlink-software/
     Twitter   -- http://twitter.com/OpenLink
     Facebook  -- http://www.facebook.com/OpenLinkSoftware
Universal Data Access, Integration, and Management Technology Providers

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Scaling Wikidata Query Service