[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

23 Feb 2023

On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen &lt;kidehen(a)openlinksw.com&gt;
wrote:

...

 On 2/22/23 3:28 AM, Guillaume Lederrey wrote:

 On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata <
 wikidata(a)lists.wikimedia.org&gt; wrote:

 On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
  Hello all!

 TL;DR: We expect to successfully complete the recent data reload on
 Wikidata Query Service soon, but we've encountered multiple failures
 related to the size of the graph, and anticipate that this issue may
 worsen in the future. Although we succeeded this time, we cannot
 guarantee that future reload attempts will be successful given the
 current trend of the data reload process. Thank you for your
 understanding and patience..

 Longer version:

 WDQS is updated from a stream of recent changes on Wikidata, with a
 maximum delay of ~2 minutes. This process was improved as part of the
 WDQS Streaming Updater project to ensure data coherence[1] . However,
 the update process is still imperfect and can lead to data
 inconsistencies in some cases[2][3]. To address this, we reload the
 data from dumps a few times per year to reinitialize the system from a
 known good state.

 The recent reload of data from dumps started in mid-December and was
 initially met with some issues related to download and instabilities
 in Blazegraph, the database used by WDQS[4]. Loading the data into
 Blazegraph takes a couple of weeks due to the size of the graph, and
 we had multiple attempts where the reload failed after >90% of the
 data had been loaded. Our understanding of the issue is that a "race
 condition" in Blazegraph[5], where subtle timing changes lead to
 corruption of the journal in some rare cases, is to blame.[6]

 We want to reassure you that the last reload job was successful on one
 of our servers. The data still needs to be copied over to all of the
 WDQS servers, which will take a couple of weeks, but should not bring
 any additional issues. However, reloading the full data from dumps is
 becoming more complex as the data size grows, and we wanted to let you
 know why the process took longer than expected. We understand that
 data inconsistencies can be problematic, and we appreciate your
 patience and understanding while we work to ensure the quality and
 consistency of the data on WDQS.

 Thank you for your continued support and understanding!

     Guillaume

 [1] https://phabricator.wikimedia.org/T244590
 [2] https://phabricator.wikimedia.org/T323239
 [3] https://phabricator.wikimedia.org/T322869
 [4] https://phabricator.wikimedia.org/T323096
 [5] https://en.wikipedia.org/wiki/Race_condition#In_software
 [6] https://phabricator.wikimedia.org/T263110
  Hi Guillaume,

 Are there plans to decouple WDQS from the back-end database? Doing that
 provides more resilient architecture for Wikidata as a whole since you
 will be able to swap and interchange SPARQL-compliant backends.

 It depends what you mean by decoupling. The coupling points as I see them
 are:

 * update process
 * UI
 * exposed SPARQL endpoint

 The update process is mostly decoupled from the backend. It is producing a
 stream of RDF updates that is backend independent, with a very thin
 Blazegraph specific adapted to load the data into Blazegraph.

 Does that mean that we could integrate the RDF stream into our setup re
 keeping our Wikidata instance up to date, for instance?
 That data stream isn't exposed publicly. There are a few tricky part about
the stream needing to be synchronized with a specific Wikidata dump that
makes it not entirely trivial to reuse outside of our internal use case.
But if there is enough interest, we could potentially work on making that
stream public.

...

 The UI is mostly backend independant. It relies on Search for some
 features. And of course, the queries themselves might depend on Blazegraph
 specific features.

 Can WDQS, based on what's stated above, work with a generic SPARQL
 back-end like Virtuoso, for instance? By that I mean dispatch SPARQL
 queries input by a user (without alteration) en route to server processing?
  The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone
from WMDE could jump in and add more context. That being said, as far as I
know, pointing it to a different backend is just a configuration option.
Feel free to have a look at the code (
https://gerrit.wikimedia.org/g/wikidata/query/gui). It should be reasonably
easy to deploy another WDQS UI instance somewhere else, which points to
whatever backend you'd like.

As a policy, we don't send traffic to any third party, so we will not setup
such an instance.

...

 The exposed SPARQL endpoint is at the moment a direct exposition of the
 Blazegraph endpoint, so it does expose all the Blazegraph specific features
 and quirks.

 Is there a Query Service that's separated from the Blazegraph endpoint?
 The crux of the matter here is that WDQS benefits more by being loosely-
 bound to endpoints rather than tightly-bound to the Blazegraph endpoint.
 It depends what you mean by Query Service. My definition of a Query Service
in this context is a SPARQL endpoint with a specific data set. That SPARQL
endpoint at the moment is Blazegraph. I'm not entirely clear what kind
loose bound you'd like to see in this context. We might have different
definitions of the same words here.

...

 What we would like to do at some point (this is not more than a rough idea
 at this point) is to add a proxy in front of the SPARQL endpoint, that
 would filter specific SPARQL features, so that we limit what is available
 to a standard set of features available across most potential backends.
 This would help reduce the coupling of queries with the backend. Of course,
 this would have the drawback of limiting the feature set.

 I'm not sure I entirely understood the question, please let me know if my
 answer is missing the point.

   Have fun!

     Guillaume

 --
 Regards,

 Kingsley Idehen	
 Founder & CEO
 OpenLink Software
 Home Page: http://www.openlinksw.com
 Community Support: https://community.openlinksw.com
 Weblogs (Blogs):
 Company Blog: https://medium.com/openlink-software-blog
 Virtuoso Blog: https://medium.com/virtuoso-blog
 Data Access Drivers Blog:
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

 Personal Weblogs (Blogs):
 Medium Blog: https://medium.com/@kidehen
 Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
               http://kidehen.blogspot.com

 Profile Pages:
 Pinterest: https://www.pinterest.com/kidehen/
 Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
 Twitter: https://twitter.com/kidehen
 Google+: https://plus.google.com/+KingsleyIdehen/about
 LinkedIn: http://www.linkedin.com/in/kidehen

 Web Identities (WebID):
 Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
         : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

-- 
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS