[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

23 Feb 2023

On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
...
  On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via
Wikidata 
 &lt;wikidata(a)lists.wikimedia.org&gt; wrote:

     On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
  Hello all!

 TL;DR: We expect to successfully complete the recent data reload on
 Wikidata Query Service soon, but we've encountered multiple
      failures
  related to the size of the graph, and anticipate
that this issue
      may
  worsen in the future. Although we succeeded this
time, we cannot
 guarantee that future reload attempts will be successful given the
 current trend of the data reload process. Thank you for your
 understanding and patience..

 Longer version:

 WDQS is updated from a stream of recent changes on Wikidata, with a
 maximum delay of ~2 minutes. This process was improved as part
      of the
  WDQS Streaming Updater project to ensure data
coherence[1] .
      However,
  the update process is still imperfect and can
lead to data
 inconsistencies in some cases[2][3]. To address this, we reload the
 data from dumps a few times per year to reinitialize the system
      from a
  known good state.

 The recent reload of data from dumps started in mid-December and
      was
  initially met with some issues related to
download and
      instabilities
  in Blazegraph, the database used by WDQS[4].
Loading the data into
 Blazegraph takes a couple of weeks due to the size of the graph,
      and
  we had multiple attempts where the reload failed
after >90% of the
 data had been loaded. Our understanding of the issue is that a
      "race
  condition" in Blazegraph[5], where subtle
timing changes lead to
 corruption of the journal in some rare cases, is to blame.[6]

 We want to reassure you that the last reload job was successful
      on one
  of our servers. The data still needs to be copied
over to all of
      the
  WDQS servers, which will take a couple of weeks,
but should not
      bring
  any additional issues. However, reloading the
full data from
      dumps is
  becoming more complex as the data size grows, and
we wanted to
      let you
  know why the process took longer than expected.
We understand that
 data inconsistencies can be problematic, and we appreciate your
 patience and understanding while we work to ensure the quality and
 consistency of the data on WDQS.

 Thank you for your continued support and understanding!

     Guillaume

 [1] https://phabricator.wikimedia.org/T244590
 [2] https://phabricator.wikimedia.org/T323239
 [3] https://phabricator.wikimedia.org/T322869
 [4] https://phabricator.wikimedia.org/T323096
 [5] https://en.wikipedia.org/wiki/Race_condition#In_software
 [6] https://phabricator.wikimedia.org/T263110

      Hi Guillaume,

     Are there plans to decouple WDQS from the back-end database? Doing
     that
     provides more resilient architecture for Wikidata as a whole since
     you
     will be able to swap and interchange SPARQL-compliant backends.

 It depends what you mean by decoupling. The coupling points as I see 
 them are:

 * update process
 * UI
 * exposed SPARQL endpoint

 The update process is mostly decoupled from the backend. It is 
 producing a stream of RDF updates that is backend independent, with a 
 very thin Blazegraph specific adapted to load the data into Blazegraph.

Does that mean that we could integrate the RDF stream into our setup re 
keeping our Wikidata instance up to date, for instance?

...

 The UI is mostly backend independant. It relies on Search for some 
 features. And of course, the queries themselves might depend on 
 Blazegraph specific features.

Can WDQS, based on what's stated above, work with a generic SPARQL 
back-end like Virtuoso, for instance? By that I mean dispatch SPARQL 
queries input by a user (without alteration) en route to server processing?

...

 The exposed SPARQL endpoint is at the moment a direct exposition of 
 the Blazegraph endpoint, so it does expose all the Blazegraph specific 
 features and quirks.

Is there a Query Service that's separated from the Blazegraph endpoint? 
The crux of the matter here is that WDQS benefits more by being loosely- 
bound to endpoints rather than tightly-bound to the Blazegraph endpoint.

...

 What we would like to do at some point (this is not more than a rough 
 idea at this point) is to add a proxy in front of the SPARQL endpoint, 
 that would filter specific SPARQL features, so that we limit what is 
 available to a standard set of features available across most 
 potential backends. This would help reduce the coupling of queries 
 with the backend. Of course, this would have the drawback of limiting 
 the feature set.

 I'm not sure I entirely understood the question, please let me know if 
 my answer is missing the point.

   Have fun!

     Guillaume

-- 
Regards,

Kingsley Idehen	
Founder & CEO
OpenLink Software
Home Page:http://www.openlinksw.com
Community Support:https://community.openlinksw.com
Weblogs (Blogs):
Company Blog:https://medium.com/openlink-software-blog
Virtuoso Blog:https://medium.com/virtuoso-blog
Data Access Drivers
Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog:https://medium.com/@kidehen
Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/
               http://kidehen.blogspot.com

Profile Pages:
Pinterest:https://www.pinterest.com/kidehen/
Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter:https://twitter.com/kidehen
Google+:https://plus.google.com/+KingsleyIdehen/about
LinkedIn:http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
         :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS