Re: [Wikidata] Scaling Wikidata Query Service

17 Jun 2019

Hello, Stas --

On Jun 13, 2019, at 07:52 PM, Stas Malyshev &lt;smalyshev(a)wikimedia.org&gt; wrote:
...

 Hi!

  It handles data locality across a shared nothing
cluster just fine i.e., you can interact with any node in a Virtuoso cluster and
experience identical behavior (everyone node looks like single node in the eyes of the
operator).  
 Does this mean no sharding, i.e. each server stores the full DB? 
No.

The full DB is automatically sharded across all Virtuoso instances in an Elastic Cluster,
and each instance *appears* to store the full DB -- i.e., you can issue a query to any
instance in an Elastic Cluster, if you have the relevant communication details (typically
IP address and port number), and you will get the same results from it as from any other
instance in that Elastic Cluster.

(I am generally specific about Elastic Cluster vs Replication Cluster, because these are
different though complementary technologies, implemented via different Modules in
Virtuoso.)

...
  This is the model we're using currently, but given
the growth of the data it may be non sustainable on current hardware. I see in your tables
that Uniprot has about 30B triples, but I wonder how update loads there look like. Our
main issue is that the hardware we have now is showing its limits when there's a lot
of updates in parallel to significant query load. So I wonder if the "single server
holds everything" model is sustainable in the long term. 
Your questions are unsurprising, and are one of the reasons for the benchmark efforts of
the LDBC --

   http://ldbcouncil.org/benchmarks/

Uniprot does not get a lot of updates, and it is running on a single instance -- i.e.,
there's no cluster involved at all, neither Elastic (Shared-Nothing) Cluster nor
Replication Cluster -- so its probably not the best example for your workflows.

I think the LDBC's Social Networking Benchmark (SNB) is likely to be the closest to
the Wikidata update and query patterns, so you may find these articles interesting --

1. SNB Interactive, Part 1: What is SNB Interactive Really About?
   https://virtuoso.openlinksw.com/blog/vdb/blog/?id=1835

2. SNB Interactive, Part 2: Modeling Choices
   https://virtuoso.openlinksw.com/blog/vdb/blog/?id=1837

3. SNB Interactive, Part 3: Choke Points and Initial Run on Virtuoso
   https://virtuoso.openlinksw.com/blog/vdb/blog/?id=1842

...
   There are live
instances of Virtuoso that demonstrate its capabilities. If you want to explore
shared-nothing cluster capabilities then our live LOD Cloud cache is the place to start
[1][2][3]. If you want to see the single-server open source edition that you have DBpedia,
DBpedia-Live, Uniprot and many other nodes in the LOD Cloud to choose from. All of these
instance are highly connected.  
 Again, here the question is not too much in "can you load 7bn triples into
Virtuoso" - we know we can. What we want to figure out whether given specific
query/update patterns we have now - it is going to give us significantly better
performance allowing to support our projected growth. And also possibly whether Virtuoso
has ways to make our update workflow be more optimal - e.g. right now if one triple
changes in Wikidata item, we're essentially downloading and updating the whole item
(not exactly since triples that stay the same are preserved but it requires a lot of data
transfer to express that in SPARQL). Would there be ways to update the things more
efficiently? 
The first thing that will improve your performance is to break out of the "stored as
JSON blobs" pattern you've been using.

Updates should not require a full download of the named graph (which I think is what your
JSON Blobs amount to) followed by an upload of the entire revised named graph.

Even if you *query* the full content of an existing named graph, determine the necessary
changes locally, and then submit an update query which includes a full set of DELETE +
INSERT statements (this "full set" only including the *changed* triples), you
should find a significant reduction in data throughput.

The live parallel to such regular updates is DBpedia-Live, which started from a static
load of dump files, and has been (and is still) continuously updated by an RDF feed based
on the Wikipedia update firehose.  The same RDF feed is made available to users of our
AMI-based DBpedia-Live mirror AMI (currently being refreshed, and soon to be made
available for new users) --

   https://aws.amazon.com/marketplace/pp/B012DSCFEK

...
   Virtuoso
handles both shared-nothing clusters and replication i.e., you can have a cluster
configuration used in conjunction with a replication topology if your solution requires
that.  
 Replication could certainly be useful I think it it's faster to update single server
and then replicate than simultaneously update all servers (that's what is happening
now). 
There are multiple Replication strategies which might be used, as well as multiple
Replication Cluster topologies which might be considered, and none of them is inherently
the fastest.

That said, periodic monolithic replication of an entire dataset or DB would certainly not
be faster than propagation of DIFFs from the master to the replica(s).  Replication via
periodic cumulative DIFFs *may* be faster than incremental DIFFs that are dispatched after
every change, but this depends on many variables.

This page of cluster topology diagrams starts with Replication-only and progresses to
Elastic-only.  (There are no illustrations of a combined Replicating-Elastic-Cluster on
this page.)

   http://vos.openlinksw.com/owiki/wiki/VOS/VirtClusteringDiagrams

Any Replication Cluster topology and methodology -- including zero Replication -- may be
combined with an Elastic (Shared-Nothing) Cluster setup.  Generally speaking, when these
are combined, an entire Elastic Cluster would take the place of each Single-Server
Instance in a given Replication topology.

I hope this helps your understanding of the available options.

Ted

--
A: Yes.                          http://www.idallen.com/topposting.html
| Q: Are you sure?           
| | A: Because it reverses the logical flow of conversation.
| | | Q: Why is top posting frowned upon?

Ted Thibodeau, Jr.           //               voice +1-781-273-0900 x32
Senior Support & Evangelism  //        mailto:tthibodeau@openlinksw.com
                             //              http://twitter.com/TallTed
OpenLink Software, Inc.      //              http://www.openlinksw.com/
         20 Burlington Mall Road, Suite 322, Burlington MA 01803
     Weblog    -- http://www.openlinksw.com/blogs/
     Community -- https://community.openlinksw.com/
     LinkedIn  -- http://www.linkedin.com/company/openlink-software/
     Twitter   -- http://twitter.com/OpenLink
     Facebook  -- http://www.facebook.com/OpenLinkSoftware
Universal Data Access, Integration, and Management Technology Providers

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Scaling Wikidata Query Service