[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

23 Feb 2023

On 2/23/23 3:09 PM, Guillaume Lederrey wrote:
...
  On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen
&lt;kidehen(a)openlinksw.com&gt; 
 wrote:

     On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
      On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen
via Wikidata
     &lt;wikidata(a)lists.wikimedia.org&gt; wrote:

         On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
  Hello all!

 TL;DR: We expect to successfully complete the recent data
          reload on
  Wikidata Query Service soon, but we've
encountered multiple
          failures
  related to the size of the graph, and anticipate
that this
          issue may
  worsen in the future. Although we succeeded this
time, we
          cannot
  guarantee that future reload attempts will be
successful
          given the
  current trend of the data reload process. Thank
you for your
 understanding and patience..

 Longer version:

 WDQS is updated from a stream of recent changes on
          Wikidata, with a
  maximum delay of ~2 minutes. This process was
improved as
          part of the
  WDQS Streaming Updater project to ensure data
coherence[1]
          . However,
  the update process is still imperfect and can
lead to data
 inconsistencies in some cases[2][3]. To address this, we
          reload the
  data from dumps a few times per year to
reinitialize the
          system from a
  known good state.

 The recent reload of data from dumps started in
          mid-December and was
  initially met with some issues related to
download and
          instabilities
  in Blazegraph, the database used by WDQS[4].
Loading the
          data into
  Blazegraph takes a couple of weeks due to the
size of the
          graph, and
  we had multiple attempts where the reload failed
after >90%
          of the
  data had been loaded. Our understanding of the
issue is
          that a "race
  condition" in Blazegraph[5], where subtle
timing changes
          lead to
  corruption of the journal in some rare cases, is
to blame.[6]

 We want to reassure you that the last reload job was
          successful on one
  of our servers. The data still needs to be copied
over to
          all of the
  WDQS servers, which will take a couple of weeks,
but should
          not bring
  any additional issues. However, reloading the
full data
          from dumps is
  becoming more complex as the data size grows, and
we wanted
          to let you
  know why the process took longer than expected.
We
          understand that
  data inconsistencies can be problematic, and we
appreciate
          your
  patience and understanding while we work to
ensure the
          quality and
  consistency of the data on WDQS.

 Thank you for your continued support and understanding!

     Guillaume

 [1] https://phabricator.wikimedia.org/T244590
 [2] https://phabricator.wikimedia.org/T323239
 [3] https://phabricator.wikimedia.org/T322869
 [4] https://phabricator.wikimedia.org/T323096
 [5] https://en.wikipedia.org/wiki/Race_condition#In_software
 [6] https://phabricator.wikimedia.org/T263110

          Hi Guillaume,

         Are there plans to decouple WDQS from the back-end database?
         Doing that
         provides more resilient architecture for Wikidata as a whole
         since you
         will be able to swap and interchange SPARQL-compliant backends.

     It depends what you mean by decoupling. The coupling points as I
     see them are:

     * update process
     * UI
     * exposed SPARQL endpoint

     The update process is mostly decoupled from the backend. It is
     producing a stream of RDF updates that is backend independent,
     with a very thin Blazegraph specific adapted to load the data
     into Blazegraph.

     Does that mean that we could integrate the RDF stream into our
     setup re keeping our Wikidata instance up to date, for instance?

 That data stream isn't exposed publicly. There are a few tricky part 
 about the stream needing to be synchronized with a specific Wikidata 
 dump that makes it not entirely trivial to reuse outside of our 
 internal use case. But if there is enough interest, we could 
 potentially work on making that stream public.

I suspect there's broad interest in this matter since it contributes to 
the overarching issue of loose-coupling re Wikidata's underlying 
infrastructure.

For starters, offering a public stream would be very useful to 3rd party 
Wikidata hosts.

...

     The UI is mostly backend independant. It relies on Search for
     some features. And of course, the queries themselves might depend
     on Blazegraph specific features.

     Can WDQS, based on what's stated above, work with a generic SPARQL
     back-end like Virtuoso, for instance? By that I mean dispatch
     SPARQL queries input by a user (without alteration) en route to
     server processing?

  The WDQS UI is managed by WMDE, my knowledge is limited. Maybe 
 someone from WMDE could jump in and add more context. That being said, 
 as far as I know, pointing it to a different backend is just a 
 configuration option. Feel free to have a look at the code 
 (https://gerrit.wikimedia.org/g/wikidata/query/gui).

I'll take a look.

...
  It should be reasonably easy to deploy another WDQS UI
instance 
 somewhere else, which points to whatever backend you'd like.

Okay, I assume that in the current state it would be sending 
Blazegraph-specific SPARQL?

...

 As a policy, we don't send traffic to any third party, so we will not 
 setup such an instance.

     The exposed SPARQL endpoint is at the moment a direct exposition
     of the Blazegraph endpoint, so it does expose all the Blazegraph
     specific features and quirks.

     Is there a Query Service that's separated from the Blazegraph
     endpoint? The crux of the matter here is that WDQS benefits more
     by being loosely- bound to endpoints rather than tightly-bound to
     the Blazegraph endpoint.

 It depends what you mean by Query Service. My definition of a Query 
 Service in this context is a SPARQL endpoint with a specific data set.

Yes, but in the case of Wikidata that's a combination of both a SPARQL 
Query Service (query processor and endpoint) and WDQS query solution 
rendering services.

...
  That SPARQL endpoint at the moment is Blazegraph.
I'm not entirely 
 clear what kind loose bound you'd like to see in this context. We 
 might have different definitions of the same words here.

Loose-coupling, in the context I am describing, would comprise the 
following:

1. WDQS that can be bolted on to any SPARQL endpoint, just like YASGUI 
<https://github.com/TriplyDB/Yasgui#this>

2. Near real-time data streams usable by 3rd Party Wikidata hosts

With the above in place, the cost and burned associated with Wikidata 
hosting will also be reduced -- courtesy of federation.

...

     What we would like to do at some point (this is not more than a
     rough idea at this point) is to add a proxy in front of the
     SPARQL endpoint, that would filter specific SPARQL features, so
     that we limit what is available to a standard set of features
     available across most potential backends. This would help reduce
     the coupling of queries with the backend. Of course, this would
     have the drawback of limiting the feature set.

As you've stated, that's narrowing service focus rather than diffusing 
service burden :)

Kingsley

...

     I'm not sure I entirely understood the question, please let me
     know if my answer is missing the point.

       Have fun!

         Guillaume

     -- 
     Regards,

     Kingsley Idehen	
     Founder & CEO
     OpenLink Software
     Home Page:http://www.openlinksw.com
     Community Support:https://community.openlinksw.com
     Weblogs (Blogs):
     Company Blog:https://medium.com/openlink-software-blog
     Virtuoso Blog:https://medium.com/virtuoso-blog
     Data Access Drivers
Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

     Personal Weblogs (Blogs):
     Medium Blog:https://medium.com/@kidehen
     Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/
                    http://kidehen.blogspot.com

     Profile Pages:
     Pinterest:https://www.pinterest.com/kidehen/
     Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen
     Twitter:https://twitter.com/kidehen
     Google+:https://plus.google.com/+KingsleyIdehen/about
     LinkedIn:http://www.linkedin.com/in/kidehen

     Web Identities (WebID):
     Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i

:http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

 -- 
 	*Guillaume Lederrey* (he/him)
 Engineering Manager
 Wikimedia Foundation <https://wikimediafoundation.org/>

-- 
Regards,

Kingsley Idehen	
Founder & CEO
OpenLink Software
Home Page:http://www.openlinksw.com
Community Support:https://community.openlinksw.com
Weblogs (Blogs):
Company Blog:https://medium.com/openlink-software-blog
Virtuoso Blog:https://medium.com/virtuoso-blog
Data Access Drivers
Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog:https://medium.com/@kidehen
Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/
               http://kidehen.blogspot.com

Profile Pages:
Pinterest:https://www.pinterest.com/kidehen/
Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter:https://twitter.com/kidehen
Google+:https://plus.google.com/+KingsleyIdehen/about
LinkedIn:http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
         :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS