[Wikidata] Re: Wikidata Query Service scaling update Aug 2021

23 Aug 2021

Hi Samuel, All,

I am the software engineer responsible for sparql.uniprot.org.
I already offered to help in https://phabricator.wikimedia.org/T206561.
So no need to ask Andra or Egon ;)

While we are good users of virtuoso, and strongly suggest it is 
evaluated. As it is in general a good product that does scale.[1]

One of the things we did differently than WDQS is to introduce a 
controlled layer between the "public" and the "database".
To allow things like query rewriting/redirection upon data model 
changes, as well as rewriting some schema rediscovery queries to a known 
faster query. We also parse the queries with RDF4J before handing them 
to virtuoso. This makes sure that the queries that we accept are only 
valid SPARQL 1.1. Avoiding users getting used to almost SPARQL dialects 
(i.e. retain the flexiblity to move to a different endpoint). We are in 
the process of updating this code and contributing it to RDF4J, with the 
first contribution in the develop/4.0.0 branch

I think a number of current customizations in WDQS can be moved to a 
front RDF4J layer. Then the RDF4J sail/repository layer can be used to 
preserve flexibility. So that WDQS can more easily switch between 
backend databases in the future.

One large difference between UniProt and WDQS is that WikiData is 
continually updated while UniProt is batch released a few times a year.
WDQS is somewhat easier in some areas and more difficult in others 
because of that.

Regards,
Jerven

[1] No Database is perfect, but it does scale a lot better than 
Blazegraph did. Which we also evaluated in the past. There is still a 
lot of potential in Virtuoso to scale even better in the future.

On 23/08/2021 21:36, Samuel Klein wrote:
...
  Ah, that's lovely.  Thanks for the update,
Kingsley!  Uniprot is a good 
 parallel to keep in mind.

 For Egon, Andra, others who work with them: Is there someone you'd 
 recommend chatting with at uniprot?
 "scaling alongside uniprot" or at least engaging them on how to solve 
 shared + comparable issues (they also offer authentication-free SPARQL 
 querying) sounds like a compelling option.

 S.

 On Thu, Aug 19, 2021 at 4:32 PM Kingsley Idehen via Wikidata 
 &lt;wikidata(a)lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org>> wrote:

     On 8/18/21 5:07 PM, Mike Pham wrote:

     Wikidata community members,

     Thank you for all of your work helping Wikidata grow and improve
     over the years. In the spirit of better communication, we would
     like to take this opportunity to share some of the current
     challenges Wikidata Query Service (WDQS) is facing, and some
     strategies we have for dealing with them.

     WDQS currently risks failing to provide acceptable service quality
     due to the following reasons:

     1.

         Blazegraph scaling

         1.

             Graph size. WDQS uses Blazegraph as our graph backend.
             While Blazegraph can theoretically support 50 billion
             edges <https://blazegraph.com/>, in reality Wikidata is
             the largest graph we know of running on Blazegraph (~13
             billion triples

<https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m>),
             and there is a risk that we will reach a size

<https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29>limit
             of what it can realistically support
             <https://phabricator.wikimedia.org/T213210>. Once
             Blazegraph is maxed out, WDQS can no longer be updated.
             This will also break Wikidata tools that rely on WDQS.

         2.

             Software support. Blazegraph is end of life software,
             which is no longer actively maintained, making it an
             unsustainable backend to continue moving forward with long
             term.

     Blazegraph maxing out in size poses the greatest risk for
     catastrophic failure, as it would effectively prevent WDQS from
     being updated further, and inevitably fall out of date. Our long
     term strategy to address this is to move to a new graph backend
     that best meets our WDQS needs and is actively maintained, and
     begin the migration off of Blazegraph as soon as a viable
     alternative is identified
     <https://phabricator.wikimedia.org/T206560>.

     Hi Mike,

     Do bear in mind that pre and post selection of Blazegraph for
     Wikidata, we've always offered an RDF-based DBMS that can handle
     current and future requirements for Wikidata, just as we do DBpedia.

     At the time of our first rendezvous, handling 50 billion triples
     would have typically required our Cluster Edition which is a
     Commercial Only offering -- basically, that was the deal breaker
     back then.

     Anyway, in recent times, our Open Source Edition has evolved to
     handle some 80 Billion+ triples (exemplified by the live Uniprot
     instance) where performance and scale is primary a function of
     available memory.

     I hope this helps.

     Related:

     [1] https://wikidata.demo.openlinksw.com/sparql
     <https://wikidata.demo.openlinksw.com/sparql>-- Our Live Wikidata
     SPARQL Query Endpoint
     [2]

https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97f…

<https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0>
     -- Google Spreadsheet about various Virtuoso Configurations
     associated with some well-known public endpoints
     [3] https://t.co/EjAAO73wwE <https://t.co/EjAAO73wwE> -- this query
     doesn't complete with the current Blazegraph-based Wikidata endpoint
     [4] https://t.co/GTATPPJNBI <https://t.co/GTATPPJNBI> -- same query
     completing when applied to the Virtuoso-based endpoint
     [5] https://t.co/X7mLmcYC69 <https://t.co/X7mLmcYC69> -- about
     loading Wikidata's datasets into a Virtuoso instance
     [6]

https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&am…

<https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kidehen&src=typed_query&f=live>
     -- various demos shared via Twitter over the years regarding Wikidata

     -- 
     Regards,

     Kingsley Idehen	
     Founder & CEO
     OpenLink Software
     Home Page:http://www.openlinksw.com  <http://www.openlinksw.com>
     Community Support:https://community.openlinksw.com 
<https://community.openlinksw.com>
     Weblogs (Blogs):
     Company Blog:https://medium.com/openlink-software-blog 
<https://medium.com/openlink-software-blog>
     Virtuoso Blog:https://medium.com/virtuoso-blog 
<https://medium.com/virtuoso-blog>
     Data Access Drivers
Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers 
<https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers>

     Personal Weblogs (Blogs):
     Medium Blog:https://medium.com/@kidehen  <https://medium.com/@kidehen>
     Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ 
<http://www.openlinksw.com/blog/~kidehen/>
                    http://kidehen.blogspot.com  <http://kidehen.blogspot.com>

     Profile Pages:
     Pinterest:https://www.pinterest.com/kidehen/ 
<https://www.pinterest.com/kidehen/>
     Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen 
<https://www.quora.com/profile/Kingsley-Uyi-Idehen>
     Twitter:https://twitter.com/kidehen  <https://twitter.com/kidehen>
     Google+:https://plus.google.com/+KingsleyIdehen/about 
<https://plus.google.com/+KingsleyIdehen/about>
     LinkedIn:http://www.linkedin.com/in/kidehen 
<http://www.linkedin.com/in/kidehen>

     Web Identities (WebID):
     Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i 
<http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i>

:http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this 
<http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this>

     _______________________________________________
     Wikidata mailing list -- wikidata(a)lists.wikimedia.org
     <mailto:wikidata@lists.wikimedia.org>
     To unsubscribe send an email to wikidata-leave(a)lists.wikimedia.org
     <mailto:wikidata-leave@lists.wikimedia.org>

 -- 
 Samuel Klein          @metasj           w:user:sj          +1 617 529 4266

 _______________________________________________
 Wikidata mailing list -- wikidata(a)lists.wikimedia.org
 To unsubscribe send an email to wikidata-leave(a)lists.wikimedia.org

-- 

	*Jerven Tjalling Bolleman*
Principal Software Developer
*SIB | Swiss Institute of Bioinformatics*
1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland
t +41 22 379 58 85
Jerven.Bolleman(a)sib.swiss - www.sib.swiss

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikidata] Re: Wikidata Query Service scaling update Aug 2021