Inconsistencies on WDQS data - data reload on WDQS - Wikidata

List overview All Threads
Download

newer

Inconsistencies on WDQS data - data reload on WDQS

older

[Receiving Applications] Wikimedia...

Re: [Wikitech-l] Call for projects...

Guillaume Lederrey

22 Feb 2023 22 Feb '23

4:05 a.m.

Hello all!

TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..

Longer version:

WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.

The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]

We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.

Thank you for your continued support and understanding!

Guillaume

[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110

-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/

Attachments:

attachment.htm (text/html — 3.6 KB)

Show replies by date

Kingsley Idehen

22 Feb 22 Feb

5:42 a.m.

On 2/21/23 4:05 PM, Guillaume Lederrey wrote:

...

Hello all!

TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..

Longer version:

WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.

The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]

We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.

Thank you for your continued support and understanding!

Guillaume

[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110

Hi Guillaume,

Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends.

BTW -- we are going to make AWS and even Azure hosted instances (offered on a PAGO basis) of our Virtuoso-hosted edition of Wikidata (which we recently reloaded).

-- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

Guillaume Lederrey

3:28 p.m.

On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote:

...

On 2/21/23 4:05 PM, Guillaume Lederrey wrote:

...
Hello all!

TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..

Longer version:

WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.

The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]

We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.

Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110
Hi Guillaume,

Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends.

It depends what you mean by decoupling. The coupling points as I see them are:

* update process * UI * exposed SPARQL endpoint

The update process is mostly decoupled from the backend. It is producing a stream of RDF updates that is backend independent, with a very thin Blazegraph specific adapted to load the data into Blazegraph.

The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.

The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.

What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.

I'm not sure I entirely understood the question, please let me know if my answer is missing the point.

Have fun!

Guillaume

...

BTW -- we are going to make AWS and even Azure hosted instances (offered on a PAGO basis) of our Virtuoso-hosted edition of Wikidata (which we recently reloaded).

-- Regards,

Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com

Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org

-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/

Kingsley Idehen

23 Feb 23 Feb

10:38 p.m.

On 2/22/23 3:28 AM, Guillaume Lederrey wrote:

...

On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata wikidata@lists.wikimedia.org wrote:

On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
> Hello all!
>
> TL;DR: We expect to successfully complete the recent data reload on
> Wikidata Query Service soon, but we've encountered multiple
failures
> related to the size of the graph, and anticipate that this issue
may
> worsen in the future. Although we succeeded this time, we cannot
> guarantee that future reload attempts will be successful given the
> current trend of the data reload process. Thank you for your
> understanding and patience..
>
> Longer version:
>
> WDQS is updated from a stream of recent changes on Wikidata, with a
> maximum delay of ~2 minutes. This process was improved as part
of the
> WDQS Streaming Updater project to ensure data coherence[1] .
However,
> the update process is still imperfect and can lead to data
> inconsistencies in some cases[2][3]. To address this, we reload the
> data from dumps a few times per year to reinitialize the system
from a
> known good state.
>
> The recent reload of data from dumps started in mid-December and
was
> initially met with some issues related to download and
instabilities
> in Blazegraph, the database used by WDQS[4]. Loading the data into
> Blazegraph takes a couple of weeks due to the size of the graph,
and
> we had multiple attempts where the reload failed after >90% of the
> data had been loaded. Our understanding of the issue is that a
"race
> condition" in Blazegraph[5], where subtle timing changes lead to
> corruption of the journal in some rare cases, is to blame.[6]
>
> We want to reassure you that the last reload job was successful
on one
> of our servers. The data still needs to be copied over to all of
the
> WDQS servers, which will take a couple of weeks, but should not
bring
> any additional issues. However, reloading the full data from
dumps is
> becoming more complex as the data size grows, and we wanted to
let you
> know why the process took longer than expected. We understand that
> data inconsistencies can be problematic, and we appreciate your
> patience and understanding while we work to ensure the quality and
> consistency of the data on WDQS.
>
> Thank you for your continued support and understanding!
>
>
>     Guillaume
>
>
> [1] https://phabricator.wikimedia.org/T244590
> [2] https://phabricator.wikimedia.org/T323239
> [3] https://phabricator.wikimedia.org/T322869
> [4] https://phabricator.wikimedia.org/T323096
> [5] https://en.wikipedia.org/wiki/Race_condition#In_software
> [6] https://phabricator.wikimedia.org/T263110
>
Hi Guillaume,

Are there plans to decouple WDQS from the back-end database? Doing
that
provides more resilient architecture for Wikidata as a whole since
you
will be able to swap and interchange SPARQL-compliant backends.

It depends what you mean by decoupling. The coupling points as I see them are:

update process
UI
exposed SPARQL endpoint

Does that mean that we could integrate the RDF stream into our setup re keeping our Wikidata instance up to date, for instance?

...

The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.

Can WDQS, based on what's stated above, work with a generic SPARQL back-end like Virtuoso, for instance? By that I mean dispatch SPARQL queries input by a user (without alteration) en route to server processing?

...

The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.

Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.

...

What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.

I'm not sure I entirely understood the question, please let me know if my answer is missing the point.

Have fun!

Guillaume

-- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page:http://www.openlinksw.com Community Support:https://community.openlinksw.com Weblogs (Blogs): Company Blog:https://medium.com/openlink-software-blog Virtuoso Blog:https://medium.com/virtuoso-blog Data Access Drivers Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers Personal Weblogs (Blogs): Medium Blog:https://medium.com/@kidehen Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com Profile Pages: Pinterest:https://www.pinterest.com/kidehen/ Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter:https://twitter.com/kidehen Google+:https://plus.google.com/+KingsleyIdehen/about LinkedIn:http://www.linkedin.com/in/kidehen Web Identities (WebID): Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

James Heald

24 Feb 24 Feb

12:19 a.m.

On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata wrote:

...

On 2/21/23 4:05 PM, Guillaume Lederrey wrote:

...

...
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.

Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.

...
What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.

I have to say I am a bit concerned by this talk, since some of Blazegraph's "features and quirks" can be exceedingly useful.

In particular I would highlight **named subqueries** and **Blazegraph's bd:sample service** as two "features and quirks" which should not be suppressed lightly.

Use of named subqueries (ie queries that include an "INCLUDE %subquery" line) is consistently popular in the "query of the week" example queries featured in the weekly summary, and for good reasons:

* they can make complex long queries far more readable * they can make optimisation of complex long queries a lot easier and a lot more transparent (or even possible at all) * they can be essential to the performance of some queries, if there is a particular retrieved set that those queries then recall to reuse in more than one way.

The Blazegraph syntax for this is elegant. Ideally the dev teams of candidate replacements should be encouraged to support it. Failing that at the very least a preprocessor should be written to suitably adapt queries with an INCLUDE directive, so that existing queries can continue to run.

In contrast, bd:sample is perhaps under-used and under-appreciated and not so well known, but can also be very valuable.

It allows to a query writer to get a genuinely random sampling of the usage of a particular triple.

For example, here's a query https://w.wiki/6NHo that I was asked for recently, that finds the most common classes of items used as values for P180 'depicts' statements on Commons.

Sampling is essential here because there are now in excess of 19.8 million P180 statements on Commons -- and becomes even more so because of the federated nature of the query, which means that only a few tens of thousands of data at most can be passed for analysis into any subquery to be run on wdqs against wikidata.

A feature like bd:sample is the only way to be able to do this kind of analysis of structured data statements on Commons.

I regard named subqueries and bd:sample as particularly important. But beyond them, we need to make sure that any 'filter' does not remove Blazegraph optimiser directives, as if those don't get through to Blazegraph many queries that rely on them simply will not run (especially if named subqueries have also been made unavailable).

Ways also need to be found to make sure the geographical services wikibase:around() and wikibase:box() continue to be available, the distance function geof:distance(), and the mwapi and labelling services.

Best regards,

James.

Peter F. Patel-Schneider

12:45 a.m.

On 2/23/23 12:19, James Heald wrote:

...

On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata wrote:

...
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:

...
...
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.

Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.

...
What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.

I have to say I am a bit concerned by this talk, since some of Blazegraph's "features and quirks" can be exceedingly useful.

I agree that some of Blazegraph's extensions to SPARQL are useful, particularly for me the ability to easily access Wikidata labels in my language.

But Blazegraph appears to be unmaintained. The team that developed Blazegraph does not appear to be in a situation that they can help in fixing problems in Blazegraph and no one else appears to be interested in fixing problems in it. Errors and other issues with Blazegraph are negatively affecting the WDQS. That's not a good state of affairs.

In my opinion the WDQS should be trying to get off Blazegraph.

peter

Kingsley Idehen

3:08 a.m.

On 2/23/23 12:19 PM, James Heald wrote:

...

On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata wrote:

...
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:

...
...
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.

Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.

...
What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.

Hi James,

...

I have to say I am a bit concerned by this talk, since some of Blazegraph's "features and quirks" can be exceedingly useful.

That isn't justification for tightly-coupling a Query Tool to a Query Service Endpoint, especially when an open standard (in the form of SPARQL) exists.

...

In particular I would highlight **named subqueries** and **Blazegraph's bd:sample service** as two "features and quirks" which should not be suppressed lightly.

See my comment above.

...

Use of named subqueries (ie queries that include an "INCLUDE %subquery" line) is consistently popular in the "query of the week" example queries featured in the weekly summary, and for good reasons:

they can make complex long queries far more readable

they can make optimisation of complex long queries a lot easier and

a lot more transparent (or even possible at all)

they can be essential to the performance of some queries, if there

is a particular retrieved set that those queries then recall to reuse in more than one way.

The Blazegraph syntax for this is elegant.

See my comments above, which are about architecture fundamentals and the virtues of loose-coupling.

...

Ideally the dev teams of candidate replacements should be encouraged to support it. Failing that at the very least a preprocessor should be written to suitably adapt queries with an INCLUDE directive, so that existing queries can continue to run.

In contrast, bd:sample is perhaps under-used and under-appreciated and not so well known, but can also be very valuable.

It allows to a query writer to get a genuinely random sampling of the usage of a particular triple.

For example, here's a query https://w.wiki/6NHo that I was asked for recently, that finds the most common classes of items used as values for P180 'depicts' statements on Commons.

Sampling is essential here because there are now in excess of 19.8 million P180 statements on Commons -- and becomes even more so because of the federated nature of the query, which means that only a few tens of thousands of data at most can be passed for analysis into any subquery to be run on wdqs against wikidata.

A feature like bd:sample is the only way to be able to do this kind of analysis of structured data statements on Commons.

I regard named subqueries and bd:sample as particularly important. But beyond them, we need to make sure that any 'filter' does not remove Blazegraph optimiser directives, as if those don't get through to Blazegraph many queries that rely on them simply will not run (especially if named subqueries have also been made unavailable).

Ways also need to be found to make sure the geographical services wikibase:around() and wikibase:box() continue to be available, the distance function geof:distance(), and the mwapi and labelling services.

Best regards,

James.

Please digest my comments above, since they have nothing to do with how Blazegraph implements its Query Service Endpoint :)

James Heald

4:17 a.m.

On 23/02/2023 20:08, Kingsley Idehen via Wikidata wrote:

...

On 2/23/23 12:19 PM, James Heald wrote:

...
I have to say I am a bit concerned by this talk, since some of Blazegraph's "features and quirks" can be exceedingly useful.

That isn't justification for tightly-coupling a Query Tool to a Query Service Endpoint, especially when an open standard (in the form of SPARQL) exists.

Of course it's a good thing to be able to swap out the back-end and to be able to run essentially the same queries against other realisations of the database.

It's also a good thing to be able to clone the user interface and use essentially the same UI with a different back-end. (As I understand it, this should be very possible).

But. There are features which have been listed in the desiderata for WDQS from the very start, that go beyond what the out-of-the-box SPARQL 1.1 standard offers.

Most notable among these is the ability to retrieve items with coordinates close to a particular point on the earth's surface. (Something which, as the Blazegraph developers discovered, can be implemented fairly easily if you add a "Z-order curve" index on coordinate values https://en.wikipedia.org/wiki/Z-order_curve ).

Not all users will have an interest in geographical objects. Those who don't will lose little if they hook up a back-end that doesn't provide this, because presumably they won't be running queries which require it. But those who do need this functionality need this indexing.

Given that this was something the Blazegraph developers (all 3 of them) found they could add relatively easily; and given that it seems to me that any database back-end would gain considerable cachet by being able to run wikidata queries, it seems to me not unreasonable to approach potential alternative back-ends and see how easily they too might be able to add a Z-order curve index for coordinate values, plus basic functionality to make use of it. (Where wikibase:box and wikibase:around are about as basic as it gets).

Andrea suggested a more GeoSPARQL-orientated solution ( https://wikitech.wikimedia.org/wiki/User:AndreaWest/Blazegraph_Features_and_... ), but that seems to me a much much bigger ask; I do suspect that (for almost all contending projects) the simple wikibase:box and wikibase:around services would be a lot more easily implemented, to free us from our tight-coupling to Blazegraph, yet still provide this functionality, which I do believe is a needed requirement.

As for named subqueries, as well as making queries much more readable, IMO they may be particularly valuable as a way to specify particular optimisations (ie sequencing of query execution, that may be absolutely *crucial* if a query is to run) in a particularly readable and **portable** way -- certainly when compared to optimiser "hint" syntaxes, that may be tied *very* specifically to a particular back-end.

Why do I think named subqueries are so portable, if they are not part of the SPARQL 1.1 standard, and most providers don't support them ?

The answer is because if necessary it would require only a fairly simple pre-processor script to turn them into inline sub-queries, which *are* supported by the standard.

Named sub-queries having the advantage though of making the query a lot more readable; and can be useful to indicate to the back-end that the sub-query need only be retrieved once, rather than repeatedly each time it is referenced (which may be helpful for some back-ends).

So: I don't disagree that it would be useful if WDQS was less tightly dependent on Blazegraph.

But: rather than going straight to removing good features, I think there is a lot of scope for seeing whether the dev teams for other back-ends could be persuaded to match the features on those back-ends without too much difficulty; and that this would be a better path to at least investigate, in preference to breaking swathes of queries that are in active use.

-- James.

Kingsley Idehen

9:26 p.m.

On 2/23/23 4:17 PM, James Heald wrote:

...

On 23/02/2023 20:08, Kingsley Idehen via Wikidata wrote:

...
On 2/23/23 12:19 PM, James Heald wrote:

...
I have to say I am a bit concerned by this talk, since some of Blazegraph's "features and quirks" can be exceedingly useful.

That isn't justification for tightly-coupling a Query Tool to a Query Service Endpoint, especially when an open standard (in the form of SPARQL) exists.

Of course it's a good thing to be able to swap out the back-end and to be able to run essentially the same queries against other realisations of the database.

It's also a good thing to be able to clone the user interface and use essentially the same UI with a different back-end. (As I understand it, this should be very possible).

Good to hear, since that's my fundamental point re loosely-coupled architecture enabled by open standards.

...

But. There are features which have been listed in the desiderata for WDQS from the very start, that go beyond what the out-of-the-box SPARQL 1.1 standard offers.

Therein lies the problem. A standards based client can include extensions for a specific back-end in configurable form based on loose-coupling principles. Doing it otherwise is what's generally known as leaky abstraction that ultimately racks up technical debt.

An example of technical debt that's manifesting right now is an inability to diffuse the costs of the Wikidata Knowledge Graph across a federation of SPARQL query service providers. This doesn't have to be the case at all, bearing in mind the nature of SPARQL and structured data represented using RDF.

...

Most notable among these is the ability to retrieve items with coordinates close to a particular point on the earth's surface. (Something which, as the Blazegraph developers discovered, can be implemented fairly easily if you add a "Z-order curve" index on coordinate values https://en.wikipedia.org/wiki/Z-order_curve ).

None of that would be lost in a WDQS instance configured to discover the SPARQL query endpoint and associated capabilities.

...

Not all users will have an interest in geographical objects. Those who don't will lose little if they hook up a back-end that doesn't provide this, because presumably they won't be running queries which require it. But those who do need this functionality need this indexing.

See my comment above.

...

Given that this was something the Blazegraph developers (all 3 of them) found they could add relatively easily; and given that it seems to me that any database back-end would gain considerable cachet by being able to run wikidata queries, it seems to me not unreasonable to approach potential alternative back-ends and see how easily they too might be able to add a Z-order curve index for coordinate values, plus basic functionality to make use of it. (Where wikibase:box and wikibase:around are about as basic as it gets).

Andrea suggested a more GeoSPARQL-orientated solution ( https://wikitech.wikimedia.org/wiki/User:AndreaWest/Blazegraph_Features_and_... ), but that seems to me a much much bigger ask; I do suspect that (for almost all contending projects) the simple wikibase:box and wikibase:around services would be a lot more easily implemented, to free us from our tight-coupling to Blazegraph, yet still provide this functionality, which I do believe is a needed requirement.

As for named subqueries, as well as making queries much more readable, IMO they may be particularly valuable as a way to specify particular optimisations (ie sequencing of query execution, that may be absolutely *crucial* if a query is to run) in a particularly readable and **portable** way -- certainly when compared to optimiser "hint" syntaxes, that may be tied *very* specifically to a particular back-end.

Why do I think named subqueries are so portable, if they are not part of the SPARQL 1.1 standard, and most providers don't support them ?

The answer is because if necessary it would require only a fairly simple pre-processor script to turn them into inline sub-queries, which *are* supported by the standard.

Named sub-queries having the advantage though of making the query a lot more readable; and can be useful to indicate to the back-end that the sub-query need only be retrieved once, rather than repeatedly each time it is referenced (which may be helpful for some back-ends).

These implementation details aren't really relevant to the fundamental point I am trying to make about the virtues of loosely-coupled architecture facilitated by existing open standards (e.g., SPARQL).

...

So: I don't disagree that it would be useful if WDQS was less tightly dependent on Blazegraph.

But: rather than going straight to removing good features, I think there is a lot of scope for seeing whether the dev teams for other back-ends could be persuaded to match the features on those back-ends without too much difficulty; and that this would be a better path to at least investigate, in preference to breaking swathes of queries that are in active use.

Nothing I've said amounts for feature removal. Everything I've said is simply about loosely-coupled architecture as a guiding principle for making WDQS usable against other SPARQL endpoints :)

Kingsley

...

-- James.

Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org

Guillaume Lederrey

3:09 a.m.

On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen kidehen@openlinksw.com wrote:

...

On 2/22/23 3:28 AM, Guillaume Lederrey wrote:

On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote:

...
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:

...
Hello all!

TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..

Longer version:

WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.

The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]

We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.

Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110
Hi Guillaume,

Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends.
It depends what you mean by decoupling. The coupling points as I see them are:

update process

UI

exposed SPARQL endpoint

The update process is mostly decoupled from the backend. It is producing a stream of RDF updates that is backend independent, with a very thin Blazegraph specific adapted to load the data into Blazegraph.

Does that mean that we could integrate the RDF stream into our setup re keeping our Wikidata instance up to date, for instance?

That data stream isn't exposed publicly. There are a few tricky part about the stream needing to be synchronized with a specific Wikidata dump that makes it not entirely trivial to reuse outside of our internal use case. But if there is enough interest, we could potentially work on making that stream public.

...

The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.

Can WDQS, based on what's stated above, work with a generic SPARQL back-end like Virtuoso, for instance? By that I mean dispatch SPARQL queries input by a user (without alteration) en route to server processing?

The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone from WMDE could jump in and add more context. That being said, as far as I know, pointing it to a different backend is just a configuration option. Feel free to have a look at the code ( https://gerrit.wikimedia.org/g/wikidata/query/gui). It should be reasonably easy to deploy another WDQS UI instance somewhere else, which points to whatever backend you'd like.

As a policy, we don't send traffic to any third party, so we will not setup such an instance.

...

The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.

Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.

It depends what you mean by Query Service. My definition of a Query Service in this context is a SPARQL endpoint with a specific data set. That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear what kind loose bound you'd like to see in this context. We might have different definitions of the same words here.

...

What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.

I'm not sure I entirely understood the question, please let me know if my answer is missing the point.

Have fun!
Guillaume
-- Regards,

Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com

Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/

Kingsley Idehen

4:56 a.m.

On 2/23/23 3:09 PM, Guillaume Lederrey wrote:

...

On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen kidehen@openlinksw.com wrote:

On 2/22/23 3:28 AM, Guillaume Lederrey wrote:

...

On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata
<wikidata@lists.wikimedia.org> wrote:


    On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
    > Hello all!
    >
    > TL;DR: We expect to successfully complete the recent data
    reload on
    > Wikidata Query Service soon, but we've encountered multiple
    failures
    > related to the size of the graph, and anticipate that this
    issue may
    > worsen in the future. Although we succeeded this time, we
    cannot
    > guarantee that future reload attempts will be successful
    given the
    > current trend of the data reload process. Thank you for your
    > understanding and patience..
    >
    > Longer version:
    >
    > WDQS is updated from a stream of recent changes on
    Wikidata, with a
    > maximum delay of ~2 minutes. This process was improved as
    part of the
    > WDQS Streaming Updater project to ensure data coherence[1]
    . However,
    > the update process is still imperfect and can lead to data
    > inconsistencies in some cases[2][3]. To address this, we
    reload the
    > data from dumps a few times per year to reinitialize the
    system from a
    > known good state.
    >
    > The recent reload of data from dumps started in
    mid-December and was
    > initially met with some issues related to download and
    instabilities
    > in Blazegraph, the database used by WDQS[4]. Loading the
    data into
    > Blazegraph takes a couple of weeks due to the size of the
    graph, and
    > we had multiple attempts where the reload failed after >90%
    of the
    > data had been loaded. Our understanding of the issue is
    that a "race
    > condition" in Blazegraph[5], where subtle timing changes
    lead to
    > corruption of the journal in some rare cases, is to blame.[6]
    >
    > We want to reassure you that the last reload job was
    successful on one
    > of our servers. The data still needs to be copied over to
    all of the
    > WDQS servers, which will take a couple of weeks, but should
    not bring
    > any additional issues. However, reloading the full data
    from dumps is
    > becoming more complex as the data size grows, and we wanted
    to let you
    > know why the process took longer than expected. We
    understand that
    > data inconsistencies can be problematic, and we appreciate
    your
    > patience and understanding while we work to ensure the
    quality and
    > consistency of the data on WDQS.
    >
    > Thank you for your continued support and understanding!
    >
    >
    >     Guillaume
    >
    >
    > [1] https://phabricator.wikimedia.org/T244590
    > [2] https://phabricator.wikimedia.org/T323239
    > [3] https://phabricator.wikimedia.org/T322869
    > [4] https://phabricator.wikimedia.org/T323096
    > [5] https://en.wikipedia.org/wiki/Race_condition#In_software
    > [6] https://phabricator.wikimedia.org/T263110
    >
    Hi Guillaume,

    Are there plans to decouple WDQS from the back-end database?
    Doing that
    provides more resilient architecture for Wikidata as a whole
    since you
    will be able to swap and interchange SPARQL-compliant backends.


It depends what you mean by decoupling. The coupling points as I
see them are:

* update process
* UI
* exposed SPARQL endpoint

The update process is mostly decoupled from the backend. It is
producing a stream of RDF updates that is backend independent,
with a very thin Blazegraph specific adapted to load the data
into Blazegraph.

Does that mean that we could integrate the RDF stream into our
setup re keeping our Wikidata instance up to date, for instance?

I suspect there's broad interest in this matter since it contributes to the overarching issue of loose-coupling re Wikidata's underlying infrastructure.

For starters, offering a public stream would be very useful to 3rd party Wikidata hosts.

...

...
The UI is mostly backend independant. It relies on Search for
some features. And of course, the queries themselves might depend
on Blazegraph specific features.
Can WDQS, based on what's stated above, work with a generic SPARQL
back-end like Virtuoso, for instance? By that I mean dispatch
SPARQL queries input by a user (without alteration) en route to
server processing?
The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone from WMDE could jump in and add more context. That being said, as far as I know, pointing it to a different backend is just a configuration option. Feel free to have a look at the code (https://gerrit.wikimedia.org/g/wikidata/query/gui).

I'll take a look.

...

It should be reasonably easy to deploy another WDQS UI instance somewhere else, which points to whatever backend you'd like.

Okay, I assume that in the current state it would be sending Blazegraph-specific SPARQL?

...

As a policy, we don't send traffic to any third party, so we will not setup such an instance.

...
The exposed SPARQL endpoint is at the moment a direct exposition
of the Blazegraph endpoint, so it does expose all the Blazegraph
specific features and quirks.
Is there a Query Service that's separated from the Blazegraph
endpoint? The crux of the matter here is that WDQS benefits more
by being loosely- bound to endpoints rather than tightly-bound to
the Blazegraph endpoint.
It depends what you mean by Query Service. My definition of a Query Service in this context is a SPARQL endpoint with a specific data set.

Yes, but in the case of Wikidata that's a combination of both a SPARQL Query Service (query processor and endpoint) and WDQS query solution rendering services.

...

That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear what kind loose bound you'd like to see in this context. We might have different definitions of the same words here.

Loose-coupling, in the context I am describing, would comprise the following:

1. WDQS that can be bolted on to any SPARQL endpoint, just like YASGUI https://github.com/TriplyDB/Yasgui#this

2. Near real-time data streams usable by 3rd Party Wikidata hosts

With the above in place, the cost and burned associated with Wikidata hosting will also be reduced -- courtesy of federation.

...

What we would like to do at some point (this is not more than a
rough idea at this point) is to add a proxy in front of the
SPARQL endpoint, that would filter specific SPARQL features, so
that we limit what is available to a standard set of features
available across most potential backends. This would help reduce
the coupling of queries with the backend. Of course, this would
have the drawback of limiting the feature set.

As you've stated, that's narrowing service focus rather than diffusing service burden :)

Kingsley

...

I'm not sure I entirely understood the question, please let me
know if my answer is missing the point.

  Have fun!

    Guillaume

-- 
Regards,

Kingsley Idehen	
Founder & CEO
OpenLink Software
Home Page:http://www.openlinksw.com
Community Support:https://community.openlinksw.com
Weblogs (Blogs):
Company Blog:https://medium.com/openlink-software-blog
Virtuoso Blog:https://medium.com/virtuoso-blog
Data Access Drivers Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog:https://medium.com/@kidehen
Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/
               http://kidehen.blogspot.com

Profile Pages:
Pinterest:https://www.pinterest.com/kidehen/
Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter:https://twitter.com/kidehen
Google+:https://plus.google.com/+KingsleyIdehen/about
LinkedIn:http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
         :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/

-- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page:http://www.openlinksw.com Community Support:https://community.openlinksw.com Weblogs (Blogs): Company Blog:https://medium.com/openlink-software-blog Virtuoso Blog:https://medium.com/virtuoso-blog Data Access Drivers Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers Personal Weblogs (Blogs): Medium Blog:https://medium.com/@kidehen Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com Profile Pages: Pinterest:https://www.pinterest.com/kidehen/ Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter:https://twitter.com/kidehen Google+:https://plus.google.com/+KingsleyIdehen/about LinkedIn:http://www.linkedin.com/in/kidehen Web Identities (WebID): Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

Guillaume Lederrey

5:59 p.m.

On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen kidehen@openlinksw.com wrote:

...

On 2/23/23 3:09 PM, Guillaume Lederrey wrote:

On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen kidehen@openlinksw.com wrote:

...
On 2/22/23 3:28 AM, Guillaume Lederrey wrote:

On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote:

...
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:

...
Hello all!

TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..

Longer version:

WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.

The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]

We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.

Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110
Hi Guillaume,

Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends.
It depends what you mean by decoupling. The coupling points as I see them are:

update process

UI

exposed SPARQL endpoint

The update process is mostly decoupled from the backend. It is producing a stream of RDF updates that is backend independent, with a very thin Blazegraph specific adapted to load the data into Blazegraph.

Does that mean that we could integrate the RDF stream into our setup re keeping our Wikidata instance up to date, for instance?
That data stream isn't exposed publicly. There are a few tricky part about the stream needing to be synchronized with a specific Wikidata dump that makes it not entirely trivial to reuse outside of our internal use case. But if there is enough interest, we could potentially work on making that stream public.

I suspect there's broad interest in this matter since it contributes to the overarching issue of loose-coupling re Wikidata's underlying infrastructure.

For starters, offering a public stream would be very useful to 3rd party Wikidata hosts.

...
The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.

Can WDQS, based on what's stated above, work with a generic SPARQL back-end like Virtuoso, for instance? By that I mean dispatch SPARQL queries input by a user (without alteration) en route to server processing?

The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone from WMDE could jump in and add more context. That being said, as far as I know, pointing it to a different backend is just a configuration option. Feel free to have a look at the code ( https://gerrit.wikimedia.org/g/wikidata/query/gui).

I'll take a look.

It should be reasonably easy to deploy another WDQS UI instance somewhere else, which points to whatever backend you'd like.

Okay, I assume that in the current state it would be sending Blazegraph-specific SPARQL?

Again, not my area of expertise, but I assume that the UI itself is issuing fairly standard SPARQL. Of course, user queries will use whatever they want. It does have dependencies on our Search interface as well, so that would have to be replicated.

...

As a policy, we don't send traffic to any third party, so we will not setup such an instance.

...
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.

Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.

It depends what you mean by Query Service. My definition of a Query Service in this context is a SPARQL endpoint with a specific data set.

Yes, but in the case of Wikidata that's a combination of both a SPARQL Query Service (query processor and endpoint) and WDQS query solution rendering services.

That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear what kind loose bound you'd like to see in this context. We might have different definitions of the same words here.

Loose-coupling, in the context I am describing, would comprise the following:

WDQS that can be bolted on to any SPARQL endpoint, just like YASGUI

https://github.com/TriplyDB/Yasgui#this

In this context, I would say "WDQS UI can be bolted to any SPARQL endpoint". In term of SPARQL itself, that should already be mostly the case. I think there is a dependency on Search as well.

...

Near real-time data streams usable by 3rd Party Wikidata hosts

With the above in place, the cost and burned associated with Wikidata hosting will also be reduced -- courtesy of federation.

Could you please open a Phabricator task to document what you would like to see exposed and why it would be useful?

...

...
What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.

As you've stated, that's narrowing service focus rather than diffusing service burden :)

Kingsley

...
I'm not sure I entirely understood the question, please let me know if my answer is missing the point.

Have fun!
Guillaume
-- Regards,

Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com

Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/

-- Regards,

Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com

Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/

Kingsley Idehen

25 Feb 25 Feb

1:30 a.m.

On 2/24/23 5:59 AM, Guillaume Lederrey wrote:

...

On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen kidehen@openlinksw.com wrote:

On 2/23/23 3:09 PM, Guillaume Lederrey wrote:

...

On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen
<kidehen@openlinksw.com> wrote:


    On 2/22/23 3:28 AM, Guillaume Lederrey wrote:

...

    On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata
    <wikidata@lists.wikimedia.org> wrote:


        On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
        > Hello all!
        >
        > TL;DR: We expect to successfully complete the recent
        data reload on
        > Wikidata Query Service soon, but we've encountered
        multiple failures
        > related to the size of the graph, and anticipate that
        this issue may
        > worsen in the future. Although we succeeded this time,
        we cannot
        > guarantee that future reload attempts will be
        successful given the
        > current trend of the data reload process. Thank you
        for your
        > understanding and patience..
        >
        > Longer version:
        >
        > WDQS is updated from a stream of recent changes on
        Wikidata, with a
        > maximum delay of ~2 minutes. This process was improved
        as part of the
        > WDQS Streaming Updater project to ensure data
        coherence[1] . However,
        > the update process is still imperfect and can lead to
        data
        > inconsistencies in some cases[2][3]. To address this,
        we reload the
        > data from dumps a few times per year to reinitialize
        the system from a
        > known good state.
        >
        > The recent reload of data from dumps started in
        mid-December and was
        > initially met with some issues related to download and
        instabilities
        > in Blazegraph, the database used by WDQS[4]. Loading
        the data into
        > Blazegraph takes a couple of weeks due to the size of
        the graph, and
        > we had multiple attempts where the reload failed after
        >90% of the
        > data had been loaded. Our understanding of the issue
        is that a "race
        > condition" in Blazegraph[5], where subtle timing
        changes lead to
        > corruption of the journal in some rare cases, is to
        blame.[6]
        >
        > We want to reassure you that the last reload job was
        successful on one
        > of our servers. The data still needs to be copied over
        to all of the
        > WDQS servers, which will take a couple of weeks, but
        should not bring
        > any additional issues. However, reloading the full
        data from dumps is
        > becoming more complex as the data size grows, and we
        wanted to let you
        > know why the process took longer than expected. We
        understand that
        > data inconsistencies can be problematic, and we
        appreciate your
        > patience and understanding while we work to ensure the
        quality and
        > consistency of the data on WDQS.
        >
        > Thank you for your continued support and understanding!
        >
        >
        >     Guillaume
        >
        >
        > [1] https://phabricator.wikimedia.org/T244590
        > [2] https://phabricator.wikimedia.org/T323239
        > [3] https://phabricator.wikimedia.org/T322869
        > [4] https://phabricator.wikimedia.org/T323096
        > [5]
        https://en.wikipedia.org/wiki/Race_condition#In_software
        > [6] https://phabricator.wikimedia.org/T263110
        >
        Hi Guillaume,

        Are there plans to decouple WDQS from the back-end
        database? Doing that
        provides more resilient architecture for Wikidata as a
        whole since you
        will be able to swap and interchange SPARQL-compliant
        backends.


    It depends what you mean by decoupling. The coupling points
    as I see them are:

    * update process
    * UI
    * exposed SPARQL endpoint

    The update process is mostly decoupled from the backend. It
    is producing a stream of RDF updates that is backend
    independent, with a very thin Blazegraph specific adapted to
    load the data into Blazegraph.

    Does that mean that we could integrate the RDF stream into
    our setup re keeping our Wikidata instance up to date, for
    instance?

That data stream isn't exposed publicly. There are a few tricky
part about the stream needing to be synchronized with a specific
Wikidata dump that makes it not entirely trivial to reuse outside
of our internal use case. But if there is enough interest, we
could potentially work on making that stream public.

I suspect there's broad interest in this matter since it
contributes to the overarching issue of loose-coupling re
Wikidata's underlying infrastructure.

For starters, offering a public stream would be very useful to 3rd
party Wikidata hosts.

...

    The UI is mostly backend independant. It relies on Search
    for some features. And of course, the queries themselves
    might depend on Blazegraph specific features.

    Can WDQS, based on what's stated above, work with a generic
    SPARQL back-end like Virtuoso, for instance? By that I mean
    dispatch SPARQL queries input by a user (without alteration)
    en route to server processing?

 The WDQS UI is managed by WMDE, my knowledge is limited. Maybe
someone from WMDE could jump in and add more context. That being
said, as far as I know, pointing it to a different backend is
just a configuration option. Feel free to have a look at the code
(https://gerrit.wikimedia.org/g/wikidata/query/gui).

I'll take a look.

...

It should be reasonably easy to deploy another WDQS UI instance
somewhere else, which points to whatever backend you'd like.

Okay, I assume that in the current state it would be sending
Blazegraph-specific SPARQL?

You mean WDQS has a Text Search interface component that's intertwined with the Query Service provided by the Wikidata SPARQL Endpoint?

...

As a policy, we don't send traffic to any third party, so we will
not setup such an instance.

...

    The exposed SPARQL endpoint is at the moment a direct
    exposition of the Blazegraph endpoint, so it does expose all
    the Blazegraph specific features and quirks.

    Is there a Query Service that's separated from the Blazegraph
    endpoint? The crux of the matter here is that WDQS benefits
    more by being loosely- bound to endpoints rather than
    tightly-bound to the Blazegraph endpoint.

It depends what you mean by Query Service. My definition of a
Query Service in this context is a SPARQL endpoint with a
specific data set.

Yes, but in the case of Wikidata that's a combination of both a
SPARQL Query Service (query processor and endpoint) and WDQS query
solution rendering services.

...

That SPARQL endpoint at the moment is Blazegraph. I'm not
entirely clear what kind loose bound you'd like to see in this
context. We might have different definitions of the same words here.

Loose-coupling, in the context I am describing, would comprise the
following:

1. WDQS that can be bolted on to any SPARQL endpoint, just like
YASGUI <https://github.com/TriplyDB/Yasgui#this>

In this context, I would say "WDQS UI can be bolted to any SPARQL endpoint". In term of SPARQL itself, that should already be mostly the case. I think there is a dependency on Search as well.

As per my earlier comment, I don't quite understand what you are referring to regarding the Search (Free Text Querying) intermingling. Does this relate to SPARQL Query Patterns comprising literal objects? If so, WDQS should be able to constrain such behavior to Blazegraph instances -- by way of configuration that informs introspection.

...

2. Near real-time data streams usable by 3rd Party Wikidata hosts

With the above in place, the cost and burned associated with
Wikidata hosting will also be reduced -- courtesy of federation.
Could you please open a Phabricator task to document what you would like to see exposed and why it would be useful?

Okay, when I (or someone else) get a moment.

-- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page:http://www.openlinksw.com Community Support:https://community.openlinksw.com Weblogs (Blogs): Company Blog:https://medium.com/openlink-software-blog Virtuoso Blog:https://medium.com/virtuoso-blog Data Access Drivers Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers Personal Weblogs (Blogs): Medium Blog:https://medium.com/@kidehen Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com Profile Pages: Pinterest:https://www.pinterest.com/kidehen/ Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter:https://twitter.com/kidehen Google+:https://plus.google.com/+KingsleyIdehen/about LinkedIn:http://www.linkedin.com/in/kidehen Web Identities (WebID): Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

Guillaume Lederrey

27 Feb 27 Feb

10:15 p.m.

On Fri, 24 Feb 2023 at 19:31, Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote:

...

On 2/24/23 5:59 AM, Guillaume Lederrey wrote:

On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen kidehen@openlinksw.com wrote:

...
On 2/23/23 3:09 PM, Guillaume Lederrey wrote:

On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen kidehen@openlinksw.com wrote:

...
On 2/22/23 3:28 AM, Guillaume Lederrey wrote:

On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote:

...
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:

...
Hello all!

TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..

Longer version:

WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from

a

...
known good state.

The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]

We want to reassure you that the last reload job was successful on

one

...
of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let

you

...
know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.

Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110
Hi Guillaume,

Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends.
It depends what you mean by decoupling. The coupling points as I see them are:

update process

UI

exposed SPARQL endpoint

The update process is mostly decoupled from the backend. It is producing a stream of RDF updates that is backend independent, with a very thin Blazegraph specific adapted to load the data into Blazegraph.

Does that mean that we could integrate the RDF stream into our setup re keeping our Wikidata instance up to date, for instance?
That data stream isn't exposed publicly. There are a few tricky part about the stream needing to be synchronized with a specific Wikidata dump that makes it not entirely trivial to reuse outside of our internal use case. But if there is enough interest, we could potentially work on making that stream public.

I suspect there's broad interest in this matter since it contributes to the overarching issue of loose-coupling re Wikidata's underlying infrastructure.

For starters, offering a public stream would be very useful to 3rd party Wikidata hosts.

...
The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.

Can WDQS, based on what's stated above, work with a generic SPARQL back-end like Virtuoso, for instance? By that I mean dispatch SPARQL queries input by a user (without alteration) en route to server processing?

The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone from WMDE could jump in and add more context. That being said, as far as I know, pointing it to a different backend is just a configuration option. Feel free to have a look at the code ( https://gerrit.wikimedia.org/g/wikidata/query/gui).

I'll take a look.

It should be reasonably easy to deploy another WDQS UI instance somewhere else, which points to whatever backend you'd like.

Okay, I assume that in the current state it would be sending Blazegraph-specific SPARQL?
Again, not my area of expertise, but I assume that the UI itself is issuing fairly standard SPARQL. Of course, user queries will use whatever they want. It does have dependencies on our Search interface as well, so that would have to be replicated.

You mean WDQS has a Text Search interface component that's intertwined with the Query Service provided by the Wikidata SPARQL Endpoint?

...
As a policy, we don't send traffic to any third party, so we will not setup such an instance.

...
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.

Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.

It depends what you mean by Query Service. My definition of a Query Service in this context is a SPARQL endpoint with a specific data set.

Yes, but in the case of Wikidata that's a combination of both a SPARQL Query Service (query processor and endpoint) and WDQS query solution rendering services.

That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear what kind loose bound you'd like to see in this context. We might have different definitions of the same words here.

Loose-coupling, in the context I am describing, would comprise the following:

WDQS that can be bolted on to any SPARQL endpoint, just like YASGUI

https://github.com/TriplyDB/Yasgui#this

In this context, I would say "WDQS UI can be bolted to any SPARQL endpoint". In term of SPARQL itself, that should already be mostly the case. I think there is a dependency on Search as well.

As per my earlier comment, I don't quite understand what you are referring to regarding the Search (Free Text Querying) intermingling. Does this relate to SPARQL Query Patterns comprising literal objects? If so, WDQS should be able to constrain such behavior to Blazegraph instances -- by way of configuration that informs introspection.

WDQS UI relies on a Search endpoint (backed by Elasticsearch) for auto completion. The requirements of low latency and reasonable ranking are something that Elasticsearch (or another Search oriented backend) does really well. But I would not expect an RDF backend to offer good ranking heuristics.

...

Near real-time data streams usable by 3rd Party Wikidata hosts

...
With the above in place, the cost and burned associated with Wikidata hosting will also be reduced -- courtesy of federation.

Could you please open a Phabricator task to document what you would like to see exposed and why it would be useful?

Okay, when I (or someone else) get a moment.

-- Regards,

Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com

Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org

-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/

Kingsley Idehen

11:17 p.m.

On 2/27/23 10:15 AM, Guillaume Lederrey wrote:

...

On Fri, 24 Feb 2023 at 19:31, Kingsley Idehen via Wikidata wikidata@lists.wikimedia.org wrote:

On 2/24/23 5:59 AM, Guillaume Lederrey wrote:

...

On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen
<kidehen@openlinksw.com> wrote:


    On 2/23/23 3:09 PM, Guillaume Lederrey wrote:

...

    On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen
    <kidehen@openlinksw.com> wrote:


        On 2/22/23 3:28 AM, Guillaume Lederrey wrote:

...

        On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via
        Wikidata <wikidata@lists.wikimedia.org> wrote:


            On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
            > Hello all!
            >
            > TL;DR: We expect to successfully complete the
            recent data reload on
            > Wikidata Query Service soon, but we've
            encountered multiple failures
            > related to the size of the graph, and anticipate
            that this issue may
            > worsen in the future. Although we succeeded this
            time, we cannot
            > guarantee that future reload attempts will be
            successful given the
            > current trend of the data reload process. Thank
            you for your
            > understanding and patience..
            >
            > Longer version:
            >
            > WDQS is updated from a stream of recent changes
            on Wikidata, with a
            > maximum delay of ~2 minutes. This process was
            improved as part of the
            > WDQS Streaming Updater project to ensure data
            coherence[1] . However,
            > the update process is still imperfect and can
            lead to data
            > inconsistencies in some cases[2][3]. To address
            this, we reload the
            > data from dumps a few times per year to
            reinitialize the system from a
            > known good state.
            >
            > The recent reload of data from dumps started in
            mid-December and was
            > initially met with some issues related to
            download and instabilities
            > in Blazegraph, the database used by WDQS[4].
            Loading the data into
            > Blazegraph takes a couple of weeks due to the
            size of the graph, and
            > we had multiple attempts where the reload failed
            after >90% of the
            > data had been loaded. Our understanding of the
            issue is that a "race
            > condition" in Blazegraph[5], where subtle timing
            changes lead to
            > corruption of the journal in some rare cases, is
            to blame.[6]
            >
            > We want to reassure you that the last reload job
            was successful on one
            > of our servers. The data still needs to be copied
            over to all of the
            > WDQS servers, which will take a couple of weeks,
            but should not bring
            > any additional issues. However, reloading the
            full data from dumps is
            > becoming more complex as the data size grows, and
            we wanted to let you
            > know why the process took longer than expected.
            We understand that
            > data inconsistencies can be problematic, and we
            appreciate your
            > patience and understanding while we work to
            ensure the quality and
            > consistency of the data on WDQS.
            >
            > Thank you for your continued support and
            understanding!
            >
            >
            >     Guillaume
            >
            >
            > [1] https://phabricator.wikimedia.org/T244590
            > [2] https://phabricator.wikimedia.org/T323239
            > [3] https://phabricator.wikimedia.org/T322869
            > [4] https://phabricator.wikimedia.org/T323096
            > [5]
            https://en.wikipedia.org/wiki/Race_condition#In_software
            > [6] https://phabricator.wikimedia.org/T263110
            >
            Hi Guillaume,

            Are there plans to decouple WDQS from the back-end
            database? Doing that
            provides more resilient architecture for Wikidata
            as a whole since you
            will be able to swap and interchange
            SPARQL-compliant backends.


        It depends what you mean by decoupling. The coupling
        points as I see them are:

        * update process
        * UI
        * exposed SPARQL endpoint

        The update process is mostly decoupled from the
        backend. It is producing a stream of RDF updates that
        is backend independent, with a very thin Blazegraph
        specific adapted to load the data into Blazegraph.

        Does that mean that we could integrate the RDF stream
        into our setup re keeping our Wikidata instance up to
        date, for instance?

    That data stream isn't exposed publicly. There are a few
    tricky part about the stream needing to be synchronized with
    a specific Wikidata dump that makes it not entirely trivial
    to reuse outside of our internal use case. But if there is
    enough interest, we could potentially work on making that
    stream public.

    I suspect there's broad interest in this matter since it
    contributes to the overarching issue of loose-coupling re
    Wikidata's underlying infrastructure.

    For starters, offering a public stream would be very useful
    to 3rd party Wikidata hosts.

...

        The UI is mostly backend independant. It relies on
        Search for some features. And of course, the queries
        themselves might depend on Blazegraph specific features.

        Can WDQS, based on what's stated above, work with a
        generic SPARQL back-end like Virtuoso, for instance? By
        that I mean dispatch SPARQL queries input by a user
        (without alteration) en route to server processing?

     The WDQS UI is managed by WMDE, my knowledge is limited.
    Maybe someone from WMDE could jump in and add more context.
    That being said, as far as I know, pointing it to a
    different backend is just a configuration option. Feel free
    to have a look at the code
    (https://gerrit.wikimedia.org/g/wikidata/query/gui).

    I'll take a look.

...

    It should be reasonably easy to deploy another WDQS UI
    instance somewhere else, which points to whatever backend
    you'd like.

    Okay, I assume that in the current state it would be sending
    Blazegraph-specific SPARQL?

Again, not my area of expertise, but I assume that the UI itself
is issuing fairly standard SPARQL. Of course, user queries will
use whatever they want. It does have dependencies on our Search
interface as well, so that would have to be replicated.

You mean WDQS has a Text Search interface component that's
intertwined with the Query Service provided by the Wikidata SPARQL
Endpoint?

...

    As a policy, we don't send traffic to any third party, so we
    will not setup such an instance.

...

        The exposed SPARQL endpoint is at the moment a direct
        exposition of the Blazegraph endpoint, so it does
        expose all the Blazegraph specific features and quirks.

        Is there a Query Service that's separated from the
        Blazegraph endpoint? The crux of the matter here is that
        WDQS benefits more by being loosely- bound to endpoints
        rather than tightly-bound to the Blazegraph endpoint.

    It depends what you mean by Query Service. My definition of
    a Query Service in this context is a SPARQL endpoint with a
    specific data set.

    Yes, but in the case of Wikidata that's a combination of both
    a SPARQL Query Service (query processor and endpoint) and
    WDQS query solution rendering services.

...

    That SPARQL endpoint at the moment is Blazegraph. I'm not
    entirely clear what kind loose bound you'd like to see in
    this context. We might have different definitions of the
    same words here.

    Loose-coupling, in the context I am describing, would
    comprise the following:

    1. WDQS that can be bolted on to any SPARQL endpoint, just
    like YASGUI <https://github.com/TriplyDB/Yasgui#this>

In this context, I would say "WDQS UI can be bolted to any SPARQL
endpoint". In term of SPARQL itself, that should already be
mostly the case. I think there is a dependency on Search as well.

As per my earlier comment, I don't quite understand what you are
referring to regarding the Search (Free Text Querying)
intermingling. Does this relate to SPARQL Query Patterns
comprising literal objects? If so, WDQS should be able to
constrain such behavior to Blazegraph instances -- by way of
configuration that informs introspection.

Virtuoso as always included text ranking as part of its native free text indexing functionality. That said, these are back-end details that WDQS should be loosely-bound to via configuration.

Example:

[1] Query Solution on text pattern "China" https://wikidata.demo.openlinksw.com/sparql?default-graph-uri=&query=+++++select+%3Fs1+as+%3Fc1%2C+%28bif%3Asearch_excerpt+%28bif%3Avector+%28%27China%27%29%2C+%3Fo1%29%29+as+%3Fc2%2C+%3Fsc%2C+%3Frank%2C+%3Fg+where+%7B%7B%7B+select+%3Fs1%2C+%28%3Fsc+*+3e-1%29+as+%3Fsc%2C+%3Fo1%2C+%28sql%3Arnk_scale+%28%3CLONG%3A%3AIRI_RANK%3E+%28%3Fs1%29%29%29+as+%3Frank%2C+%3Fg+where++%0D%0A++%7B+%0D%0A++++quad+map+virtrdf%3ADefaultQuadMap+%0D%0A++++%7B+%0D%0A++++++graph+%3Fg+%0D%0A++++++%7B+%0D%0A+++++++++%3Fs1+%3Fs1textp+%3Fo1+.%0D%0A++++++++%3Fo1+bif%3Acontains++%27%22China%22%27++option+%28score+%3Fsc%29++.%0D%0A++++++++%0D%0A++++++%7D%0D%0A+++++%7D%0D%0A++++%0D%0A++%7D%0D%0A+order+by+desc+%28%3Fsc+*+3e-1+%2B+sql%3Arnk_scale+%28%3CLONG%3A%3AIRI_RANK%3E+%28%3Fs1%29%29%29++limit+50++offset+0+%7D%7D%7D+&format=text%2Fx-html%2Btr&timeout=30000&signal_void=on

[2] SPARQL Query Definition that includes Text Ranking https://wikidata.demo.openlinksw.com/sparql?default-graph-uri=&qtxt=+++++select+%3Fs1+as+%3Fc1%2C+%28bif%3Asearch_excerpt+%28bif%3Avector+%28%27China%27%29%2C+%3Fo1%29%29+as+%3Fc2%2C+%3Fsc%2C+%3Frank%2C+%3Fg+where+%7B%7B%7B+select+%3Fs1%2C+%28%3Fsc+*+3e-1%29+as+%3Fsc%2C+%3Fo1%2C+%28sql%3Arnk_scale+%28%3CLONG%3A%3AIRI_RANK%3E+%28%3Fs1%29%29%29+as+%3Frank%2C+%3Fg+where++%0D%0A++%7B+%0D%0A++++quad+map+virtrdf%3ADefaultQuadMap+%0D%0A++++%7B+%0D%0A++++++graph+%3Fg+%0D%0A++++++%7B+%0D%0A+++++++++%3Fs1+%3Fs1textp+%3Fo1+.%0D%0A++++++++%3Fo1+bif%3Acontains++%27%22China%22%27++option+%28score+%3Fsc%29++.%0D%0A++++++++%0D%0A++++++%7D%0D%0A+++++%7D%0D%0A++++%0D%0A++%7D%0D%0A+order+by+desc+%28%3Fsc+*+3e-1+%2B+sql%3Arnk_scale+%28%3CLONG%3A%3AIRI_RANK%3E+%28%3Fs1%29%29%29++limit+50++offset+0+%7D%7D%7D+&format=text%2Fx-html%2Btr&timeout=30000&signal_void=on

The key thing here is too decouple WDQS such that it can work with other back-ends en route to a much more resilient federation of Wikidata Knowledge Graph instances.

There's too much Blazegraph specificity in place right now.

-- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page:http://www.openlinksw.com Community Support:https://community.openlinksw.com Weblogs (Blogs): Company Blog:https://medium.com/openlink-software-blog Virtuoso Blog:https://medium.com/virtuoso-blog Data Access Drivers Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers Personal Weblogs (Blogs): Medium Blog:https://medium.com/@kidehen Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com Profile Pages: Pinterest:https://www.pinterest.com/kidehen/ Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter:https://twitter.com/kidehen Google+:https://plus.google.com/+KingsleyIdehen/about LinkedIn:http://www.linkedin.com/in/kidehen Web Identities (WebID): Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

Samuel Klein

25 Feb 25 Feb

2:25 a.m.

This is an important topic. Let's migrate off of Blazegraph.

No, really: what's the status of WDQS backend updates https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update, like risk projections and timelines for migration? [1 https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/thread/B4GTI6TDEKS7Q2OMJR26XWLFYMUXSR6F/#YORJB4AYYSUFSYM7H3VTZSOZBC4GTEOZ ]

Guillaume Lederrey glederrey@wikimedia.org wrote:

...

Near real-time data streams usable by 3rd Party Wikidata hosts

...
Could you please open a Phabricator task to document what you would like to see exposed and why it would be useful?

I started a ticket: https://phabricator.wikimedia.org/T330521 Anyone interested, please edit as needed.

Kingsley Idehen

2:54 a.m.

On 2/24/23 2:25 PM, Samuel Klein wrote:

...

This is an important topic. Let's migrate off of Blazegraph.

No, really: what's the status of WDQS backend updates https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update, like risk projections and timelines for migration? [1 https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/thread/B4GTI6TDEKS7Q2OMJR26XWLFYMUXSR6F/#YORJB4AYYSUFSYM7H3VTZSOZBC4GTEOZ]

Guillaume Lederrey glederrey@wikimedia.org wrote:
    2. Near real-time data streams usable by 3rd Party Wikidata hosts

 Could you please open a Phabricator task to document what you
would like to see exposed and why it would be useful?
I started a ticket: https://phabricator.wikimedia.org/T330521 Anyone interested, please edit as needed.

Hi Samuel,

Thanks for opening that up ticket!

Kingsley

...

Wikidata mailing list --wikidata@lists.wikimedia.org Public archives athttps://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email towikidata-leave@lists.wikimedia.org

-- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page:http://www.openlinksw.com Community Support:https://community.openlinksw.com Weblogs (Blogs): Company Blog:https://medium.com/openlink-software-blog Virtuoso Blog:https://medium.com/virtuoso-blog Data Access Drivers Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers Personal Weblogs (Blogs): Medium Blog:https://medium.com/@kidehen Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com Profile Pages: Pinterest:https://www.pinterest.com/kidehen/ Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter:https://twitter.com/kidehen Google+:https://plus.google.com/+KingsleyIdehen/about LinkedIn:http://www.linkedin.com/in/kidehen Web Identities (WebID): Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

Thad Guidry

22 Feb 22 Feb

10:44 a.m.

Hi Guillaume,

Which file system is used with Blazegraph? Is it NFS or Ext4, etc.? Specifically, the file system used where Journal files are written and read from? [1] Because looking at the code, it seems there could be cases where unreported errors can happen around file locking.

[1] https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/...

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Feb 22, 2023 at 5:06 AM Guillaume Lederrey glederrey@wikimedia.org wrote:

...

Hello all!

TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..

Longer version:

WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.

The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]

We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.

Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110

-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/ _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org

Guillaume Lederrey

3:32 p.m.

On Wed, 22 Feb 2023 at 04:45, Thad Guidry thadguidry@gmail.com wrote:

...

Hi Guillaume,

Which file system is used with Blazegraph? Is it NFS or Ext4, etc.? Specifically, the file system used where Journal files are written and read from? [1] Because looking at the code, it seems there could be cases where unreported errors can happen around file locking.

We are using Ext4. I don't understand enough about the Blazegraph internals to know if that might be an issue or not. But given your question, I assume that the locking issues are probably more related to running on NFS.

...

[1] https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/...

Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/

On Wed, Feb 22, 2023 at 5:06 AM Guillaume Lederrey < glederrey@wikimedia.org> wrote:

...
Hello all!

TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..

Longer version:

WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.

The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]

We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.

Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110

-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/ _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org

-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/

672

Age (days ago)

678

Last active (days ago)

wikidata@lists.wikimedia.org

18 comments

6 participants

tags (0)

participants (6)

Guillaume Lederrey
James Heald
Kingsley Idehen
Peter F. Patel-Schneider
Samuel Klein
Thad Guidry