[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

27 Feb 2023

On 2/27/23 10:15 AM, Guillaume Lederrey wrote:
...
  On Fri, 24 Feb 2023 at 19:31, Kingsley Idehen via
Wikidata 
 &lt;wikidata(a)lists.wikimedia.org&gt; wrote:

     On 2/24/23 5:59 AM, Guillaume Lederrey wrote:
      On Thu, 23 Feb 2023 at 22:56, Kingsley
Idehen
     &lt;kidehen(a)openlinksw.com&gt; wrote:

         On 2/23/23 3:09 PM, Guillaume Lederrey wrote:
          On Thu, 23 Feb 2023 at 16:39, Kingsley
Idehen
         &lt;kidehen(a)openlinksw.com&gt; wrote:

             On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
              On Wed, 22 Feb 2023 at 00:03,
Kingsley Idehen via
             Wikidata &lt;wikidata(a)lists.wikimedia.org&gt; wrote:

                 On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
                 > Hello all!
                 >
                 > TL;DR: We expect to successfully complete the
                 recent data reload on
                 > Wikidata Query Service soon, but we've
                 encountered multiple failures
                 > related to the size of the graph, and anticipate
                 that this issue may
                 > worsen in the future. Although we succeeded this
                 time, we cannot
                 > guarantee that future reload attempts will be
                 successful given the
                 > current trend of the data reload process. Thank
                 you for your
                 > understanding and patience..
                 >
                 > Longer version:
                 >
                 > WDQS is updated from a stream of recent changes
                 on Wikidata, with a
                 > maximum delay of ~2 minutes. This process was
                 improved as part of the
                 > WDQS Streaming Updater project to ensure data
                 coherence[1] . However,
                 > the update process is still imperfect and can
                 lead to data
                 > inconsistencies in some cases[2][3]. To address
                 this, we reload the
                 > data from dumps a few times per year to
                 reinitialize the system from a
                 > known good state.
                 >
                 > The recent reload of data from dumps started in
                 mid-December and was
                 > initially met with some issues related to
                 download and instabilities
                 > in Blazegraph, the database used by WDQS[4].
                 Loading the data into
                 > Blazegraph takes a couple of weeks due to the
                 size of the graph, and
                 > we had multiple attempts where the reload failed
                 after >90% of the
                 > data had been loaded. Our understanding of the
                 issue is that a "race
                 > condition" in Blazegraph[5], where subtle timing
                 changes lead to
                 > corruption of the journal in some rare cases, is
                 to blame.[6]
                 >
                 > We want to reassure you that the last reload job
                 was successful on one
                 > of our servers. The data still needs to be copied
                 over to all of the
                 > WDQS servers, which will take a couple of weeks,
                 but should not bring
                 > any additional issues. However, reloading the
                 full data from dumps is
                 > becoming more complex as the data size grows, and
                 we wanted to let you
                 > know why the process took longer than expected.
                 We understand that
                 > data inconsistencies can be problematic, and we
                 appreciate your
                 > patience and understanding while we work to
                 ensure the quality and
                 > consistency of the data on WDQS.
                 >
                 > Thank you for your continued support and
                 understanding!
                 >
                 >
                 >     Guillaume
                 >
                 >
                 > [1] https://phabricator.wikimedia.org/T244590
                 > [2] https://phabricator.wikimedia.org/T323239
                 > [3] https://phabricator.wikimedia.org/T322869
                 > [4] https://phabricator.wikimedia.org/T323096
                 > [5]
                 https://en.wikipedia.org/wiki/Race_condition#In_software
                 > [6] https://phabricator.wikimedia.org/T263110
                 >
                 Hi Guillaume,

                 Are there plans to decouple WDQS from the back-end
                 database? Doing that
                 provides more resilient architecture for Wikidata
                 as a whole since you
                 will be able to swap and interchange
                 SPARQL-compliant backends.

             It depends what you mean by decoupling. The coupling
             points as I see them are:

             * update process
             * UI
             * exposed SPARQL endpoint

             The update process is mostly decoupled from the
             backend. It is producing a stream of RDF updates that
             is backend independent, with a very thin Blazegraph
             specific adapted to load the data into Blazegraph.

             Does that mean that we could integrate the RDF stream
             into our setup re keeping our Wikidata instance up to
             date, for instance?

         That data stream isn't exposed publicly. There are a few
         tricky part about the stream needing to be synchronized with
         a specific Wikidata dump that makes it not entirely trivial
         to reuse outside of our internal use case. But if there is
         enough interest, we could potentially work on making that
         stream public.

         I suspect there's broad interest in this matter since it
         contributes to the overarching issue of loose-coupling re
         Wikidata's underlying infrastructure.

         For starters, offering a public stream would be very useful
         to 3rd party Wikidata hosts.

             The UI is mostly backend independant. It relies on
             Search for some features. And of course, the queries
             themselves might depend on Blazegraph specific features.

             Can WDQS, based on what's stated above, work with a
             generic SPARQL back-end like Virtuoso, for instance? By
             that I mean dispatch SPARQL queries input by a user
             (without alteration) en route to server processing?

          The WDQS UI is managed by WMDE, my knowledge is limited.
         Maybe someone from WMDE could jump in and add more context.
         That being said, as far as I know, pointing it to a
         different backend is just a configuration option. Feel free
         to have a look at the code
         (https://gerrit.wikimedia.org/g/wikidata/query/gui).

         I'll take a look.

          It should be reasonably easy to deploy
another WDQS UI
         instance somewhere else, which points to whatever backend
         you'd like.

         Okay, I assume that in the current state it would be sending
         Blazegraph-specific SPARQL?

     Again, not my area of expertise, but I assume that the UI itself
     is issuing fairly standard SPARQL. Of course, user queries will
     use whatever they want. It does have dependencies on our Search
     interface as well, so that would have to be replicated.

     You mean WDQS has a Text Search interface component that's
     intertwined with the Query Service provided by the Wikidata SPARQL
     Endpoint?

As a policy, we don't send traffic to any third party, so we
         will not setup such an instance.

             The exposed SPARQL endpoint is at the moment a direct
             exposition of the Blazegraph endpoint, so it does
             expose all the Blazegraph specific features and quirks.

             Is there a Query Service that's separated from the
             Blazegraph endpoint? The crux of the matter here is that
             WDQS benefits more by being loosely- bound to endpoints
             rather than tightly-bound to the Blazegraph endpoint.

         It depends what you mean by Query Service. My definition of
         a Query Service in this context is a SPARQL endpoint with a
         specific data set.

         Yes, but in the case of Wikidata that's a combination of both
         a SPARQL Query Service (query processor and endpoint) and
         WDQS query solution rendering services.

          That SPARQL endpoint at the moment is
Blazegraph. I'm not
         entirely clear what kind loose bound you'd like to see in
         this context. We might have different definitions of the
         same words here.

         Loose-coupling, in the context I am describing, would
         comprise the following:

         1. WDQS that can be bolted on to any SPARQL endpoint, just
         like YASGUI <https://github.com/TriplyDB/Yasgui#this>

     In this context, I would say "WDQS UI can be bolted to any SPARQL
     endpoint". In term of SPARQL itself, that should already be
     mostly the case. I think there is a dependency on Search as well.

     As per my earlier comment, I don't quite understand what you are
     referring to regarding the Search (Free Text Querying)
     intermingling. Does this relate to SPARQL Query Patterns
     comprising literal objects? If so, WDQS should be able to
     constrain such behavior to Blazegraph instances -- by way of
     configuration that informs introspection.

 WDQS UI relies on a Search endpoint (backed by Elasticsearch) for auto 
 completion. The requirements of low latency and reasonable ranking are 
 something that Elasticsearch (or another Search oriented backend) does 
 really well. But I would not expect an RDF backend to offer good 
 ranking heuristics.

Virtuoso as always included text ranking as part of its native free text 
indexing functionality. That said, these are back-end details that WDQS 
should be loosely-bound to via configuration.

Example:

[1] Query Solution on text pattern "China" 
<https://wikidata.demo.openlinksw.com/sparql?default-graph-uri=&query=+++++select+%3Fs1+as+%3Fc1%2C+%28bif%3Asearch_excerpt+%28bif%3Avector+%28%27China%27%29%2C+%3Fo1%29%29+as+%3Fc2%2C+%3Fsc%2C+%3Frank%2C+%3Fg+where+%7B%7B%7B+select+%3Fs1%2C+%28%3Fsc+*+3e-1%29+as+%3Fsc%2C+%3Fo1%2C+%28sql%3Arnk_scale+%28%3CLONG%3A%3AIRI_RANK%3E+%28%3Fs1%29%29%29+as+%3Frank%2C+%3Fg+where++%0D%0A++%7B+%0D%0A++++quad+map+virtrdf%3ADefaultQuadMap+%0D%0A++++%7B+%0D%0A++++++graph+%3Fg+%0D%0A++++++%7B+%0D%0A+++++++++%3Fs1+%3Fs1textp+%3Fo1+.%0D%0A++++++++%3Fo1+bif%3Acontains++%27%22China%22%27++option+%28score+%3Fsc%29++.%0D%0A++++++++%0D%0A++++++%7D%0D%0A+++++%7D%0D%0A++++%0D%0A++%7D%0D%0A+order+by+desc+%28%3Fsc+*+3e-1+%2B+sql%3Arnk_scale+%28%3CLONG%3A%3AIRI_RANK%3E+%28%3Fs1%29%29%29++limit+50++offset+0+%7D%7D%7D+&format=text%2Fx-html%2Btr&timeout=30000&signal_void=on>

[2] SPARQL Query Definition that includes Text Ranking 
<https://wikidata.demo.openlinksw.com/sparql?default-graph-uri=&qtxt=+++++select+%3Fs1+as+%3Fc1%2C+%28bif%3Asearch_excerpt+%28bif%3Avector+%28%27China%27%29%2C+%3Fo1%29%29+as+%3Fc2%2C+%3Fsc%2C+%3Frank%2C+%3Fg+where+%7B%7B%7B+select+%3Fs1%2C+%28%3Fsc+*+3e-1%29+as+%3Fsc%2C+%3Fo1%2C+%28sql%3Arnk_scale+%28%3CLONG%3A%3AIRI_RANK%3E+%28%3Fs1%29%29%29+as+%3Frank%2C+%3Fg+where++%0D%0A++%7B+%0D%0A++++quad+map+virtrdf%3ADefaultQuadMap+%0D%0A++++%7B+%0D%0A++++++graph+%3Fg+%0D%0A++++++%7B+%0D%0A+++++++++%3Fs1+%3Fs1textp+%3Fo1+.%0D%0A++++++++%3Fo1+bif%3Acontains++%27%22China%22%27++option+%28score+%3Fsc%29++.%0D%0A++++++++%0D%0A++++++%7D%0D%0A+++++%7D%0D%0A++++%0D%0A++%7D%0D%0A+order+by+desc+%28%3Fsc+*+3e-1+%2B+sql%3Arnk_scale+%28%3CLONG%3A%3AIRI_RANK%3E+%28%3Fs1%29%29%29++limit+50++offset+0+%7D%7D%7D+&format=text%2Fx-html%2Btr&timeout=30000&signal_void=on>

The key thing here is too decouple WDQS such that it can work with other 
back-ends en route to a much more resilient federation of Wikidata 
Knowledge Graph instances.

There's too much Blazegraph specificity in place right now.

-- 
Regards,

Kingsley Idehen	
Founder & CEO
OpenLink Software
Home Page:http://www.openlinksw.com
Community Support:https://community.openlinksw.com
Weblogs (Blogs):
Company Blog:https://medium.com/openlink-software-blog
Virtuoso Blog:https://medium.com/virtuoso-blog
Data Access Drivers
Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog:https://medium.com/@kidehen
Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/
               http://kidehen.blogspot.com

Profile Pages:
Pinterest:https://www.pinterest.com/kidehen/
Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter:https://twitter.com/kidehen
Google+:https://plus.google.com/+KingsleyIdehen/about
LinkedIn:http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
         :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS