Hello all!
TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..
Longer version:
WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.
The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]
We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.
Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
Hello all!
TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..
Longer version:
WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.
The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]
We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.
Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110
Hi Guillaume,
Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends.
BTW -- we are going to make AWS and even Azure hosted instances (offered on a PAGO basis) of our Virtuoso-hosted edition of Wikidata (which we recently reloaded).
On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote:
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
Hello all!
TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..
Longer version:
WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.
The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]
We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.
Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110
Hi Guillaume,
Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends.
It depends what you mean by decoupling. The coupling points as I see them are:
* update process * UI * exposed SPARQL endpoint
The update process is mostly decoupled from the backend. It is producing a stream of RDF updates that is backend independent, with a very thin Blazegraph specific adapted to load the data into Blazegraph.
The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.
What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.
I'm not sure I entirely understood the question, please let me know if my answer is missing the point.
Have fun!
Guillaume
BTW -- we are going to make AWS and even Azure hosted instances (offered on a PAGO basis) of our Virtuoso-hosted edition of Wikidata (which we recently reloaded).
-- Regards,
Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com
Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen
Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata wikidata@lists.wikimedia.org wrote:
On 2/21/23 4:05 PM, Guillaume Lederrey wrote: > Hello all! > > TL;DR: We expect to successfully complete the recent data reload on > Wikidata Query Service soon, but we've encountered multiple failures > related to the size of the graph, and anticipate that this issue may > worsen in the future. Although we succeeded this time, we cannot > guarantee that future reload attempts will be successful given the > current trend of the data reload process. Thank you for your > understanding and patience.. > > Longer version: > > WDQS is updated from a stream of recent changes on Wikidata, with a > maximum delay of ~2 minutes. This process was improved as part of the > WDQS Streaming Updater project to ensure data coherence[1] . However, > the update process is still imperfect and can lead to data > inconsistencies in some cases[2][3]. To address this, we reload the > data from dumps a few times per year to reinitialize the system from a > known good state. > > The recent reload of data from dumps started in mid-December and was > initially met with some issues related to download and instabilities > in Blazegraph, the database used by WDQS[4]. Loading the data into > Blazegraph takes a couple of weeks due to the size of the graph, and > we had multiple attempts where the reload failed after >90% of the > data had been loaded. Our understanding of the issue is that a "race > condition" in Blazegraph[5], where subtle timing changes lead to > corruption of the journal in some rare cases, is to blame.[6] > > We want to reassure you that the last reload job was successful on one > of our servers. The data still needs to be copied over to all of the > WDQS servers, which will take a couple of weeks, but should not bring > any additional issues. However, reloading the full data from dumps is > becoming more complex as the data size grows, and we wanted to let you > know why the process took longer than expected. We understand that > data inconsistencies can be problematic, and we appreciate your > patience and understanding while we work to ensure the quality and > consistency of the data on WDQS. > > Thank you for your continued support and understanding! > > > Guillaume > > > [1] https://phabricator.wikimedia.org/T244590 > [2] https://phabricator.wikimedia.org/T323239 > [3] https://phabricator.wikimedia.org/T322869 > [4] https://phabricator.wikimedia.org/T323096 > [5] https://en.wikipedia.org/wiki/Race_condition#In_software > [6] https://phabricator.wikimedia.org/T263110 > Hi Guillaume, Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends.
It depends what you mean by decoupling. The coupling points as I see them are:
- update process
- UI
- exposed SPARQL endpoint
The update process is mostly decoupled from the backend. It is producing a stream of RDF updates that is backend independent, with a very thin Blazegraph specific adapted to load the data into Blazegraph.
Does that mean that we could integrate the RDF stream into our setup re keeping our Wikidata instance up to date, for instance?
The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.
Can WDQS, based on what's stated above, work with a generic SPARQL back-end like Virtuoso, for instance? By that I mean dispatch SPARQL queries input by a user (without alteration) en route to server processing?
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.
Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.
What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.
I'm not sure I entirely understood the question, please let me know if my answer is missing the point.
Have fun!
Guillaume
On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata wrote:
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.
Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.
What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.
I have to say I am a bit concerned by this talk, since some of Blazegraph's "features and quirks" can be exceedingly useful.
In particular I would highlight **named subqueries** and **Blazegraph's bd:sample service** as two "features and quirks" which should not be suppressed lightly.
Use of named subqueries (ie queries that include an "INCLUDE %subquery" line) is consistently popular in the "query of the week" example queries featured in the weekly summary, and for good reasons:
* they can make complex long queries far more readable * they can make optimisation of complex long queries a lot easier and a lot more transparent (or even possible at all) * they can be essential to the performance of some queries, if there is a particular retrieved set that those queries then recall to reuse in more than one way.
The Blazegraph syntax for this is elegant. Ideally the dev teams of candidate replacements should be encouraged to support it. Failing that at the very least a preprocessor should be written to suitably adapt queries with an INCLUDE directive, so that existing queries can continue to run.
In contrast, bd:sample is perhaps under-used and under-appreciated and not so well known, but can also be very valuable.
It allows to a query writer to get a genuinely random sampling of the usage of a particular triple.
For example, here's a query https://w.wiki/6NHo that I was asked for recently, that finds the most common classes of items used as values for P180 'depicts' statements on Commons.
Sampling is essential here because there are now in excess of 19.8 million P180 statements on Commons -- and becomes even more so because of the federated nature of the query, which means that only a few tens of thousands of data at most can be passed for analysis into any subquery to be run on wdqs against wikidata.
A feature like bd:sample is the only way to be able to do this kind of analysis of structured data statements on Commons.
I regard named subqueries and bd:sample as particularly important. But beyond them, we need to make sure that any 'filter' does not remove Blazegraph optimiser directives, as if those don't get through to Blazegraph many queries that rely on them simply will not run (especially if named subqueries have also been made unavailable).
Ways also need to be found to make sure the geographical services wikibase:around() and wikibase:box() continue to be available, the distance function geof:distance(), and the mwapi and labelling services.
Best regards,
James.
On 2/23/23 12:19, James Heald wrote:
On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata wrote:
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.
Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.
What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.
I have to say I am a bit concerned by this talk, since some of Blazegraph's "features and quirks" can be exceedingly useful.
I agree that some of Blazegraph's extensions to SPARQL are useful, particularly for me the ability to easily access Wikidata labels in my language.
But Blazegraph appears to be unmaintained. The team that developed Blazegraph does not appear to be in a situation that they can help in fixing problems in Blazegraph and no one else appears to be interested in fixing problems in it. Errors and other issues with Blazegraph are negatively affecting the WDQS. That's not a good state of affairs.
In my opinion the WDQS should be trying to get off Blazegraph.
peter
On 2/23/23 12:19 PM, James Heald wrote:
On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata wrote:
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.
Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.
What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.
Hi James,
I have to say I am a bit concerned by this talk, since some of Blazegraph's "features and quirks" can be exceedingly useful.
That isn't justification for tightly-coupling a Query Tool to a Query Service Endpoint, especially when an open standard (in the form of SPARQL) exists.
In particular I would highlight **named subqueries** and **Blazegraph's bd:sample service** as two "features and quirks" which should not be suppressed lightly.
See my comment above.
Use of named subqueries (ie queries that include an "INCLUDE %subquery" line) is consistently popular in the "query of the week" example queries featured in the weekly summary, and for good reasons:
- they can make complex long queries far more readable
- they can make optimisation of complex long queries a lot easier and
a lot more transparent (or even possible at all)
- they can be essential to the performance of some queries, if there
is a particular retrieved set that those queries then recall to reuse in more than one way.
The Blazegraph syntax for this is elegant.
See my comments above, which are about architecture fundamentals and the virtues of loose-coupling.
Ideally the dev teams of candidate replacements should be encouraged to support it. Failing that at the very least a preprocessor should be written to suitably adapt queries with an INCLUDE directive, so that existing queries can continue to run.
In contrast, bd:sample is perhaps under-used and under-appreciated and not so well known, but can also be very valuable.
It allows to a query writer to get a genuinely random sampling of the usage of a particular triple.
For example, here's a query https://w.wiki/6NHo that I was asked for recently, that finds the most common classes of items used as values for P180 'depicts' statements on Commons.
Sampling is essential here because there are now in excess of 19.8 million P180 statements on Commons -- and becomes even more so because of the federated nature of the query, which means that only a few tens of thousands of data at most can be passed for analysis into any subquery to be run on wdqs against wikidata.
A feature like bd:sample is the only way to be able to do this kind of analysis of structured data statements on Commons.
I regard named subqueries and bd:sample as particularly important. But beyond them, we need to make sure that any 'filter' does not remove Blazegraph optimiser directives, as if those don't get through to Blazegraph many queries that rely on them simply will not run (especially if named subqueries have also been made unavailable).
Ways also need to be found to make sure the geographical services wikibase:around() and wikibase:box() continue to be available, the distance function geof:distance(), and the mwapi and labelling services.
Best regards,
James.
Please digest my comments above, since they have nothing to do with how Blazegraph implements its Query Service Endpoint :)
On 23/02/2023 20:08, Kingsley Idehen via Wikidata wrote:
On 2/23/23 12:19 PM, James Heald wrote:
I have to say I am a bit concerned by this talk, since some of Blazegraph's "features and quirks" can be exceedingly useful.
That isn't justification for tightly-coupling a Query Tool to a Query Service Endpoint, especially when an open standard (in the form of SPARQL) exists.
Of course it's a good thing to be able to swap out the back-end and to be able to run essentially the same queries against other realisations of the database.
It's also a good thing to be able to clone the user interface and use essentially the same UI with a different back-end. (As I understand it, this should be very possible).
But. There are features which have been listed in the desiderata for WDQS from the very start, that go beyond what the out-of-the-box SPARQL 1.1 standard offers.
Most notable among these is the ability to retrieve items with coordinates close to a particular point on the earth's surface. (Something which, as the Blazegraph developers discovered, can be implemented fairly easily if you add a "Z-order curve" index on coordinate values https://en.wikipedia.org/wiki/Z-order_curve ).
Not all users will have an interest in geographical objects. Those who don't will lose little if they hook up a back-end that doesn't provide this, because presumably they won't be running queries which require it. But those who do need this functionality need this indexing.
Given that this was something the Blazegraph developers (all 3 of them) found they could add relatively easily; and given that it seems to me that any database back-end would gain considerable cachet by being able to run wikidata queries, it seems to me not unreasonable to approach potential alternative back-ends and see how easily they too might be able to add a Z-order curve index for coordinate values, plus basic functionality to make use of it. (Where wikibase:box and wikibase:around are about as basic as it gets).
Andrea suggested a more GeoSPARQL-orientated solution ( https://wikitech.wikimedia.org/wiki/User:AndreaWest/Blazegraph_Features_and_... ), but that seems to me a much much bigger ask; I do suspect that (for almost all contending projects) the simple wikibase:box and wikibase:around services would be a lot more easily implemented, to free us from our tight-coupling to Blazegraph, yet still provide this functionality, which I do believe is a needed requirement.
As for named subqueries, as well as making queries much more readable, IMO they may be particularly valuable as a way to specify particular optimisations (ie sequencing of query execution, that may be absolutely *crucial* if a query is to run) in a particularly readable and **portable** way -- certainly when compared to optimiser "hint" syntaxes, that may be tied *very* specifically to a particular back-end.
Why do I think named subqueries are so portable, if they are not part of the SPARQL 1.1 standard, and most providers don't support them ?
The answer is because if necessary it would require only a fairly simple pre-processor script to turn them into inline sub-queries, which *are* supported by the standard.
Named sub-queries having the advantage though of making the query a lot more readable; and can be useful to indicate to the back-end that the sub-query need only be retrieved once, rather than repeatedly each time it is referenced (which may be helpful for some back-ends).
So: I don't disagree that it would be useful if WDQS was less tightly dependent on Blazegraph.
But: rather than going straight to removing good features, I think there is a lot of scope for seeing whether the dev teams for other back-ends could be persuaded to match the features on those back-ends without too much difficulty; and that this would be a better path to at least investigate, in preference to breaking swathes of queries that are in active use.
-- James.
On 2/23/23 4:17 PM, James Heald wrote:
On 23/02/2023 20:08, Kingsley Idehen via Wikidata wrote:
On 2/23/23 12:19 PM, James Heald wrote:
I have to say I am a bit concerned by this talk, since some of Blazegraph's "features and quirks" can be exceedingly useful.
That isn't justification for tightly-coupling a Query Tool to a Query Service Endpoint, especially when an open standard (in the form of SPARQL) exists.
Of course it's a good thing to be able to swap out the back-end and to be able to run essentially the same queries against other realisations of the database.
It's also a good thing to be able to clone the user interface and use essentially the same UI with a different back-end. (As I understand it, this should be very possible).
Good to hear, since that's my fundamental point re loosely-coupled architecture enabled by open standards.
But. There are features which have been listed in the desiderata for WDQS from the very start, that go beyond what the out-of-the-box SPARQL 1.1 standard offers.
Therein lies the problem. A standards based client can include extensions for a specific back-end in configurable form based on loose-coupling principles. Doing it otherwise is what's generally known as leaky abstraction that ultimately racks up technical debt.
An example of technical debt that's manifesting right now is an inability to diffuse the costs of the Wikidata Knowledge Graph across a federation of SPARQL query service providers. This doesn't have to be the case at all, bearing in mind the nature of SPARQL and structured data represented using RDF.
Most notable among these is the ability to retrieve items with coordinates close to a particular point on the earth's surface. (Something which, as the Blazegraph developers discovered, can be implemented fairly easily if you add a "Z-order curve" index on coordinate values https://en.wikipedia.org/wiki/Z-order_curve ).
None of that would be lost in a WDQS instance configured to discover the SPARQL query endpoint and associated capabilities.
Not all users will have an interest in geographical objects. Those who don't will lose little if they hook up a back-end that doesn't provide this, because presumably they won't be running queries which require it. But those who do need this functionality need this indexing.
See my comment above.
Given that this was something the Blazegraph developers (all 3 of them) found they could add relatively easily; and given that it seems to me that any database back-end would gain considerable cachet by being able to run wikidata queries, it seems to me not unreasonable to approach potential alternative back-ends and see how easily they too might be able to add a Z-order curve index for coordinate values, plus basic functionality to make use of it. (Where wikibase:box and wikibase:around are about as basic as it gets).
Andrea suggested a more GeoSPARQL-orientated solution ( https://wikitech.wikimedia.org/wiki/User:AndreaWest/Blazegraph_Features_and_... ), but that seems to me a much much bigger ask; I do suspect that (for almost all contending projects) the simple wikibase:box and wikibase:around services would be a lot more easily implemented, to free us from our tight-coupling to Blazegraph, yet still provide this functionality, which I do believe is a needed requirement.
As for named subqueries, as well as making queries much more readable, IMO they may be particularly valuable as a way to specify particular optimisations (ie sequencing of query execution, that may be absolutely *crucial* if a query is to run) in a particularly readable and **portable** way -- certainly when compared to optimiser "hint" syntaxes, that may be tied *very* specifically to a particular back-end.
Why do I think named subqueries are so portable, if they are not part of the SPARQL 1.1 standard, and most providers don't support them ?
The answer is because if necessary it would require only a fairly simple pre-processor script to turn them into inline sub-queries, which *are* supported by the standard.
Named sub-queries having the advantage though of making the query a lot more readable; and can be useful to indicate to the back-end that the sub-query need only be retrieved once, rather than repeatedly each time it is referenced (which may be helpful for some back-ends).
These implementation details aren't really relevant to the fundamental point I am trying to make about the virtues of loosely-coupled architecture facilitated by existing open standards (e.g., SPARQL).
So: I don't disagree that it would be useful if WDQS was less tightly dependent on Blazegraph.
But: rather than going straight to removing good features, I think there is a lot of scope for seeing whether the dev teams for other back-ends could be persuaded to match the features on those back-ends without too much difficulty; and that this would be a better path to at least investigate, in preference to breaking swathes of queries that are in active use.
Nothing I've said amounts for feature removal. Everything I've said is simply about loosely-coupled architecture as a guiding principle for making WDQS usable against other SPARQL endpoints :)
Kingsley
-- James.
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen kidehen@openlinksw.com wrote:
On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote:
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
Hello all!
TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..
Longer version:
WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.
The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]
We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.
Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110
Hi Guillaume,
Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends.
It depends what you mean by decoupling. The coupling points as I see them are:
- update process
- UI
- exposed SPARQL endpoint
The update process is mostly decoupled from the backend. It is producing a stream of RDF updates that is backend independent, with a very thin Blazegraph specific adapted to load the data into Blazegraph.
Does that mean that we could integrate the RDF stream into our setup re keeping our Wikidata instance up to date, for instance?
That data stream isn't exposed publicly. There are a few tricky part about the stream needing to be synchronized with a specific Wikidata dump that makes it not entirely trivial to reuse outside of our internal use case. But if there is enough interest, we could potentially work on making that stream public.
The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.
Can WDQS, based on what's stated above, work with a generic SPARQL back-end like Virtuoso, for instance? By that I mean dispatch SPARQL queries input by a user (without alteration) en route to server processing?
The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone from WMDE could jump in and add more context. That being said, as far as I know, pointing it to a different backend is just a configuration option. Feel free to have a look at the code ( https://gerrit.wikimedia.org/g/wikidata/query/gui). It should be reasonably easy to deploy another WDQS UI instance somewhere else, which points to whatever backend you'd like.
As a policy, we don't send traffic to any third party, so we will not setup such an instance.
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.
Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.
It depends what you mean by Query Service. My definition of a Query Service in this context is a SPARQL endpoint with a specific data set. That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear what kind loose bound you'd like to see in this context. We might have different definitions of the same words here.
What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.
I'm not sure I entirely understood the question, please let me know if my answer is missing the point.
Have fun!
Guillaume
-- Regards,
Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com
Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen
Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
On 2/23/23 3:09 PM, Guillaume Lederrey wrote:
On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen kidehen@openlinksw.com wrote:
On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata <wikidata@lists.wikimedia.org> wrote: On 2/21/23 4:05 PM, Guillaume Lederrey wrote: > Hello all! > > TL;DR: We expect to successfully complete the recent data reload on > Wikidata Query Service soon, but we've encountered multiple failures > related to the size of the graph, and anticipate that this issue may > worsen in the future. Although we succeeded this time, we cannot > guarantee that future reload attempts will be successful given the > current trend of the data reload process. Thank you for your > understanding and patience.. > > Longer version: > > WDQS is updated from a stream of recent changes on Wikidata, with a > maximum delay of ~2 minutes. This process was improved as part of the > WDQS Streaming Updater project to ensure data coherence[1] . However, > the update process is still imperfect and can lead to data > inconsistencies in some cases[2][3]. To address this, we reload the > data from dumps a few times per year to reinitialize the system from a > known good state. > > The recent reload of data from dumps started in mid-December and was > initially met with some issues related to download and instabilities > in Blazegraph, the database used by WDQS[4]. Loading the data into > Blazegraph takes a couple of weeks due to the size of the graph, and > we had multiple attempts where the reload failed after >90% of the > data had been loaded. Our understanding of the issue is that a "race > condition" in Blazegraph[5], where subtle timing changes lead to > corruption of the journal in some rare cases, is to blame.[6] > > We want to reassure you that the last reload job was successful on one > of our servers. The data still needs to be copied over to all of the > WDQS servers, which will take a couple of weeks, but should not bring > any additional issues. However, reloading the full data from dumps is > becoming more complex as the data size grows, and we wanted to let you > know why the process took longer than expected. We understand that > data inconsistencies can be problematic, and we appreciate your > patience and understanding while we work to ensure the quality and > consistency of the data on WDQS. > > Thank you for your continued support and understanding! > > > Guillaume > > > [1] https://phabricator.wikimedia.org/T244590 > [2] https://phabricator.wikimedia.org/T323239 > [3] https://phabricator.wikimedia.org/T322869 > [4] https://phabricator.wikimedia.org/T323096 > [5] https://en.wikipedia.org/wiki/Race_condition#In_software > [6] https://phabricator.wikimedia.org/T263110 > Hi Guillaume, Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends. It depends what you mean by decoupling. The coupling points as I see them are: * update process * UI * exposed SPARQL endpoint The update process is mostly decoupled from the backend. It is producing a stream of RDF updates that is backend independent, with a very thin Blazegraph specific adapted to load the data into Blazegraph.
Does that mean that we could integrate the RDF stream into our setup re keeping our Wikidata instance up to date, for instance?
That data stream isn't exposed publicly. There are a few tricky part about the stream needing to be synchronized with a specific Wikidata dump that makes it not entirely trivial to reuse outside of our internal use case. But if there is enough interest, we could potentially work on making that stream public.
I suspect there's broad interest in this matter since it contributes to the overarching issue of loose-coupling re Wikidata's underlying infrastructure.
For starters, offering a public stream would be very useful to 3rd party Wikidata hosts.
The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.
Can WDQS, based on what's stated above, work with a generic SPARQL back-end like Virtuoso, for instance? By that I mean dispatch SPARQL queries input by a user (without alteration) en route to server processing?
The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone from WMDE could jump in and add more context. That being said, as far as I know, pointing it to a different backend is just a configuration option. Feel free to have a look at the code (https://gerrit.wikimedia.org/g/wikidata/query/gui).
I'll take a look.
It should be reasonably easy to deploy another WDQS UI instance somewhere else, which points to whatever backend you'd like.
Okay, I assume that in the current state it would be sending Blazegraph-specific SPARQL?
As a policy, we don't send traffic to any third party, so we will not setup such an instance.
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.
Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.
It depends what you mean by Query Service. My definition of a Query Service in this context is a SPARQL endpoint with a specific data set.
Yes, but in the case of Wikidata that's a combination of both a SPARQL Query Service (query processor and endpoint) and WDQS query solution rendering services.
That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear what kind loose bound you'd like to see in this context. We might have different definitions of the same words here.
Loose-coupling, in the context I am describing, would comprise the following:
1. WDQS that can be bolted on to any SPARQL endpoint, just like YASGUI https://github.com/TriplyDB/Yasgui#this
2. Near real-time data streams usable by 3rd Party Wikidata hosts
With the above in place, the cost and burned associated with Wikidata hosting will also be reduced -- courtesy of federation.
What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.
As you've stated, that's narrowing service focus rather than diffusing service burden :)
Kingsley
I'm not sure I entirely understood the question, please let me know if my answer is missing the point. Have fun! Guillaume
-- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page:http://www.openlinksw.com Community Support:https://community.openlinksw.com Weblogs (Blogs): Company Blog:https://medium.com/openlink-software-blog Virtuoso Blog:https://medium.com/virtuoso-blog Data Access Drivers Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers Personal Weblogs (Blogs): Medium Blog:https://medium.com/@kidehen Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com Profile Pages: Pinterest:https://www.pinterest.com/kidehen/ Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter:https://twitter.com/kidehen Google+:https://plus.google.com/+KingsleyIdehen/about LinkedIn:http://www.linkedin.com/in/kidehen Web Identities (WebID): Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/
On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen kidehen@openlinksw.com wrote:
On 2/23/23 3:09 PM, Guillaume Lederrey wrote:
On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen kidehen@openlinksw.com wrote:
On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote:
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
Hello all!
TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..
Longer version:
WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.
The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]
We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.
Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110
Hi Guillaume,
Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends.
It depends what you mean by decoupling. The coupling points as I see them are:
- update process
- UI
- exposed SPARQL endpoint
The update process is mostly decoupled from the backend. It is producing a stream of RDF updates that is backend independent, with a very thin Blazegraph specific adapted to load the data into Blazegraph.
Does that mean that we could integrate the RDF stream into our setup re keeping our Wikidata instance up to date, for instance?
That data stream isn't exposed publicly. There are a few tricky part about the stream needing to be synchronized with a specific Wikidata dump that makes it not entirely trivial to reuse outside of our internal use case. But if there is enough interest, we could potentially work on making that stream public.
I suspect there's broad interest in this matter since it contributes to the overarching issue of loose-coupling re Wikidata's underlying infrastructure.
For starters, offering a public stream would be very useful to 3rd party Wikidata hosts.
The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.
Can WDQS, based on what's stated above, work with a generic SPARQL back-end like Virtuoso, for instance? By that I mean dispatch SPARQL queries input by a user (without alteration) en route to server processing?
The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone from WMDE could jump in and add more context. That being said, as far as I know, pointing it to a different backend is just a configuration option. Feel free to have a look at the code ( https://gerrit.wikimedia.org/g/wikidata/query/gui).
I'll take a look.
It should be reasonably easy to deploy another WDQS UI instance somewhere else, which points to whatever backend you'd like.
Okay, I assume that in the current state it would be sending Blazegraph-specific SPARQL?
Again, not my area of expertise, but I assume that the UI itself is issuing fairly standard SPARQL. Of course, user queries will use whatever they want. It does have dependencies on our Search interface as well, so that would have to be replicated.
As a policy, we don't send traffic to any third party, so we will not setup such an instance.
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.
Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.
It depends what you mean by Query Service. My definition of a Query Service in this context is a SPARQL endpoint with a specific data set.
Yes, but in the case of Wikidata that's a combination of both a SPARQL Query Service (query processor and endpoint) and WDQS query solution rendering services.
That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear what kind loose bound you'd like to see in this context. We might have different definitions of the same words here.
Loose-coupling, in the context I am describing, would comprise the following:
- WDQS that can be bolted on to any SPARQL endpoint, just like YASGUI
In this context, I would say "WDQS UI can be bolted to any SPARQL endpoint". In term of SPARQL itself, that should already be mostly the case. I think there is a dependency on Search as well.
- Near real-time data streams usable by 3rd Party Wikidata hosts
With the above in place, the cost and burned associated with Wikidata hosting will also be reduced -- courtesy of federation.
Could you please open a Phabricator task to document what you would like to see exposed and why it would be useful?
What we would like to do at some point (this is not more than a rough idea at this point) is to add a proxy in front of the SPARQL endpoint, that would filter specific SPARQL features, so that we limit what is available to a standard set of features available across most potential backends. This would help reduce the coupling of queries with the backend. Of course, this would have the drawback of limiting the feature set.
As you've stated, that's narrowing service focus rather than diffusing service burden :)
Kingsley
I'm not sure I entirely understood the question, please let me know if my answer is missing the point.
Have fun!
Guillaume
-- Regards,
Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com
Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen
Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/
-- Regards,
Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com
Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen
Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
On 2/24/23 5:59 AM, Guillaume Lederrey wrote:
On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen kidehen@openlinksw.com wrote:
On 2/23/23 3:09 PM, Guillaume Lederrey wrote:
On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen <kidehen@openlinksw.com> wrote: On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata <wikidata@lists.wikimedia.org> wrote: On 2/21/23 4:05 PM, Guillaume Lederrey wrote: > Hello all! > > TL;DR: We expect to successfully complete the recent data reload on > Wikidata Query Service soon, but we've encountered multiple failures > related to the size of the graph, and anticipate that this issue may > worsen in the future. Although we succeeded this time, we cannot > guarantee that future reload attempts will be successful given the > current trend of the data reload process. Thank you for your > understanding and patience.. > > Longer version: > > WDQS is updated from a stream of recent changes on Wikidata, with a > maximum delay of ~2 minutes. This process was improved as part of the > WDQS Streaming Updater project to ensure data coherence[1] . However, > the update process is still imperfect and can lead to data > inconsistencies in some cases[2][3]. To address this, we reload the > data from dumps a few times per year to reinitialize the system from a > known good state. > > The recent reload of data from dumps started in mid-December and was > initially met with some issues related to download and instabilities > in Blazegraph, the database used by WDQS[4]. Loading the data into > Blazegraph takes a couple of weeks due to the size of the graph, and > we had multiple attempts where the reload failed after >90% of the > data had been loaded. Our understanding of the issue is that a "race > condition" in Blazegraph[5], where subtle timing changes lead to > corruption of the journal in some rare cases, is to blame.[6] > > We want to reassure you that the last reload job was successful on one > of our servers. The data still needs to be copied over to all of the > WDQS servers, which will take a couple of weeks, but should not bring > any additional issues. However, reloading the full data from dumps is > becoming more complex as the data size grows, and we wanted to let you > know why the process took longer than expected. We understand that > data inconsistencies can be problematic, and we appreciate your > patience and understanding while we work to ensure the quality and > consistency of the data on WDQS. > > Thank you for your continued support and understanding! > > > Guillaume > > > [1] https://phabricator.wikimedia.org/T244590 > [2] https://phabricator.wikimedia.org/T323239 > [3] https://phabricator.wikimedia.org/T322869 > [4] https://phabricator.wikimedia.org/T323096 > [5] https://en.wikipedia.org/wiki/Race_condition#In_software > [6] https://phabricator.wikimedia.org/T263110 > Hi Guillaume, Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends. It depends what you mean by decoupling. The coupling points as I see them are: * update process * UI * exposed SPARQL endpoint The update process is mostly decoupled from the backend. It is producing a stream of RDF updates that is backend independent, with a very thin Blazegraph specific adapted to load the data into Blazegraph.
Does that mean that we could integrate the RDF stream into our setup re keeping our Wikidata instance up to date, for instance? That data stream isn't exposed publicly. There are a few tricky part about the stream needing to be synchronized with a specific Wikidata dump that makes it not entirely trivial to reuse outside of our internal use case. But if there is enough interest, we could potentially work on making that stream public.
I suspect there's broad interest in this matter since it contributes to the overarching issue of loose-coupling re Wikidata's underlying infrastructure. For starters, offering a public stream would be very useful to 3rd party Wikidata hosts.
The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.
Can WDQS, based on what's stated above, work with a generic SPARQL back-end like Virtuoso, for instance? By that I mean dispatch SPARQL queries input by a user (without alteration) en route to server processing? The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone from WMDE could jump in and add more context. That being said, as far as I know, pointing it to a different backend is just a configuration option. Feel free to have a look at the code (https://gerrit.wikimedia.org/g/wikidata/query/gui).
I'll take a look.
It should be reasonably easy to deploy another WDQS UI instance somewhere else, which points to whatever backend you'd like.
Okay, I assume that in the current state it would be sending Blazegraph-specific SPARQL?
Again, not my area of expertise, but I assume that the UI itself is issuing fairly standard SPARQL. Of course, user queries will use whatever they want. It does have dependencies on our Search interface as well, so that would have to be replicated.
You mean WDQS has a Text Search interface component that's intertwined with the Query Service provided by the Wikidata SPARQL Endpoint?
As a policy, we don't send traffic to any third party, so we will not setup such an instance.
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.
Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint. It depends what you mean by Query Service. My definition of a Query Service in this context is a SPARQL endpoint with a specific data set.
Yes, but in the case of Wikidata that's a combination of both a SPARQL Query Service (query processor and endpoint) and WDQS query solution rendering services.
That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear what kind loose bound you'd like to see in this context. We might have different definitions of the same words here.
Loose-coupling, in the context I am describing, would comprise the following: 1. WDQS that can be bolted on to any SPARQL endpoint, just like YASGUI <https://github.com/TriplyDB/Yasgui#this>
In this context, I would say "WDQS UI can be bolted to any SPARQL endpoint". In term of SPARQL itself, that should already be mostly the case. I think there is a dependency on Search as well.
As per my earlier comment, I don't quite understand what you are referring to regarding the Search (Free Text Querying) intermingling. Does this relate to SPARQL Query Patterns comprising literal objects? If so, WDQS should be able to constrain such behavior to Blazegraph instances -- by way of configuration that informs introspection.
2. Near real-time data streams usable by 3rd Party Wikidata hosts With the above in place, the cost and burned associated with Wikidata hosting will also be reduced -- courtesy of federation.
Could you please open a Phabricator task to document what you would like to see exposed and why it would be useful?
Okay, when I (or someone else) get a moment.
On Fri, 24 Feb 2023 at 19:31, Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote:
On 2/24/23 5:59 AM, Guillaume Lederrey wrote:
On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen kidehen@openlinksw.com wrote:
On 2/23/23 3:09 PM, Guillaume Lederrey wrote:
On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen kidehen@openlinksw.com wrote:
On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote:
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
Hello all!
TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..
Longer version:
WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from
a
known good state.
The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]
We want to reassure you that the last reload job was successful on
one
of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let
you
know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.
Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110
Hi Guillaume,
Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends.
It depends what you mean by decoupling. The coupling points as I see them are:
- update process
- UI
- exposed SPARQL endpoint
The update process is mostly decoupled from the backend. It is producing a stream of RDF updates that is backend independent, with a very thin Blazegraph specific adapted to load the data into Blazegraph.
Does that mean that we could integrate the RDF stream into our setup re keeping our Wikidata instance up to date, for instance?
That data stream isn't exposed publicly. There are a few tricky part about the stream needing to be synchronized with a specific Wikidata dump that makes it not entirely trivial to reuse outside of our internal use case. But if there is enough interest, we could potentially work on making that stream public.
I suspect there's broad interest in this matter since it contributes to the overarching issue of loose-coupling re Wikidata's underlying infrastructure.
For starters, offering a public stream would be very useful to 3rd party Wikidata hosts.
The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.
Can WDQS, based on what's stated above, work with a generic SPARQL back-end like Virtuoso, for instance? By that I mean dispatch SPARQL queries input by a user (without alteration) en route to server processing?
The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone from WMDE could jump in and add more context. That being said, as far as I know, pointing it to a different backend is just a configuration option. Feel free to have a look at the code ( https://gerrit.wikimedia.org/g/wikidata/query/gui).
I'll take a look.
It should be reasonably easy to deploy another WDQS UI instance somewhere else, which points to whatever backend you'd like.
Okay, I assume that in the current state it would be sending Blazegraph-specific SPARQL?
Again, not my area of expertise, but I assume that the UI itself is issuing fairly standard SPARQL. Of course, user queries will use whatever they want. It does have dependencies on our Search interface as well, so that would have to be replicated.
You mean WDQS has a Text Search interface component that's intertwined with the Query Service provided by the Wikidata SPARQL Endpoint?
As a policy, we don't send traffic to any third party, so we will not setup such an instance.
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.
Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint.
It depends what you mean by Query Service. My definition of a Query Service in this context is a SPARQL endpoint with a specific data set.
Yes, but in the case of Wikidata that's a combination of both a SPARQL Query Service (query processor and endpoint) and WDQS query solution rendering services.
That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear what kind loose bound you'd like to see in this context. We might have different definitions of the same words here.
Loose-coupling, in the context I am describing, would comprise the following:
- WDQS that can be bolted on to any SPARQL endpoint, just like YASGUI
In this context, I would say "WDQS UI can be bolted to any SPARQL endpoint". In term of SPARQL itself, that should already be mostly the case. I think there is a dependency on Search as well.
As per my earlier comment, I don't quite understand what you are referring to regarding the Search (Free Text Querying) intermingling. Does this relate to SPARQL Query Patterns comprising literal objects? If so, WDQS should be able to constrain such behavior to Blazegraph instances -- by way of configuration that informs introspection.
WDQS UI relies on a Search endpoint (backed by Elasticsearch) for auto completion. The requirements of low latency and reasonable ranking are something that Elasticsearch (or another Search oriented backend) does really well. But I would not expect an RDF backend to offer good ranking heuristics.
- Near real-time data streams usable by 3rd Party Wikidata hosts
With the above in place, the cost and burned associated with Wikidata hosting will also be reduced -- courtesy of federation.
Could you please open a Phabricator task to document what you would like to see exposed and why it would be useful?
Okay, when I (or someone else) get a moment.
-- Regards,
Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com
Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen
Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
On 2/27/23 10:15 AM, Guillaume Lederrey wrote:
On Fri, 24 Feb 2023 at 19:31, Kingsley Idehen via Wikidata wikidata@lists.wikimedia.org wrote:
On 2/24/23 5:59 AM, Guillaume Lederrey wrote:
On Thu, 23 Feb 2023 at 22:56, Kingsley Idehen <kidehen@openlinksw.com> wrote: On 2/23/23 3:09 PM, Guillaume Lederrey wrote:
On Thu, 23 Feb 2023 at 16:39, Kingsley Idehen <kidehen@openlinksw.com> wrote: On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
On Wed, 22 Feb 2023 at 00:03, Kingsley Idehen via Wikidata <wikidata@lists.wikimedia.org> wrote: On 2/21/23 4:05 PM, Guillaume Lederrey wrote: > Hello all! > > TL;DR: We expect to successfully complete the recent data reload on > Wikidata Query Service soon, but we've encountered multiple failures > related to the size of the graph, and anticipate that this issue may > worsen in the future. Although we succeeded this time, we cannot > guarantee that future reload attempts will be successful given the > current trend of the data reload process. Thank you for your > understanding and patience.. > > Longer version: > > WDQS is updated from a stream of recent changes on Wikidata, with a > maximum delay of ~2 minutes. This process was improved as part of the > WDQS Streaming Updater project to ensure data coherence[1] . However, > the update process is still imperfect and can lead to data > inconsistencies in some cases[2][3]. To address this, we reload the > data from dumps a few times per year to reinitialize the system from a > known good state. > > The recent reload of data from dumps started in mid-December and was > initially met with some issues related to download and instabilities > in Blazegraph, the database used by WDQS[4]. Loading the data into > Blazegraph takes a couple of weeks due to the size of the graph, and > we had multiple attempts where the reload failed after >90% of the > data had been loaded. Our understanding of the issue is that a "race > condition" in Blazegraph[5], where subtle timing changes lead to > corruption of the journal in some rare cases, is to blame.[6] > > We want to reassure you that the last reload job was successful on one > of our servers. The data still needs to be copied over to all of the > WDQS servers, which will take a couple of weeks, but should not bring > any additional issues. However, reloading the full data from dumps is > becoming more complex as the data size grows, and we wanted to let you > know why the process took longer than expected. We understand that > data inconsistencies can be problematic, and we appreciate your > patience and understanding while we work to ensure the quality and > consistency of the data on WDQS. > > Thank you for your continued support and understanding! > > > Guillaume > > > [1] https://phabricator.wikimedia.org/T244590 > [2] https://phabricator.wikimedia.org/T323239 > [3] https://phabricator.wikimedia.org/T322869 > [4] https://phabricator.wikimedia.org/T323096 > [5] https://en.wikipedia.org/wiki/Race_condition#In_software > [6] https://phabricator.wikimedia.org/T263110 > Hi Guillaume, Are there plans to decouple WDQS from the back-end database? Doing that provides more resilient architecture for Wikidata as a whole since you will be able to swap and interchange SPARQL-compliant backends. It depends what you mean by decoupling. The coupling points as I see them are: * update process * UI * exposed SPARQL endpoint The update process is mostly decoupled from the backend. It is producing a stream of RDF updates that is backend independent, with a very thin Blazegraph specific adapted to load the data into Blazegraph.
Does that mean that we could integrate the RDF stream into our setup re keeping our Wikidata instance up to date, for instance? That data stream isn't exposed publicly. There are a few tricky part about the stream needing to be synchronized with a specific Wikidata dump that makes it not entirely trivial to reuse outside of our internal use case. But if there is enough interest, we could potentially work on making that stream public.
I suspect there's broad interest in this matter since it contributes to the overarching issue of loose-coupling re Wikidata's underlying infrastructure. For starters, offering a public stream would be very useful to 3rd party Wikidata hosts.
The UI is mostly backend independant. It relies on Search for some features. And of course, the queries themselves might depend on Blazegraph specific features.
Can WDQS, based on what's stated above, work with a generic SPARQL back-end like Virtuoso, for instance? By that I mean dispatch SPARQL queries input by a user (without alteration) en route to server processing? The WDQS UI is managed by WMDE, my knowledge is limited. Maybe someone from WMDE could jump in and add more context. That being said, as far as I know, pointing it to a different backend is just a configuration option. Feel free to have a look at the code (https://gerrit.wikimedia.org/g/wikidata/query/gui).
I'll take a look.
It should be reasonably easy to deploy another WDQS UI instance somewhere else, which points to whatever backend you'd like.
Okay, I assume that in the current state it would be sending Blazegraph-specific SPARQL? Again, not my area of expertise, but I assume that the UI itself is issuing fairly standard SPARQL. Of course, user queries will use whatever they want. It does have dependencies on our Search interface as well, so that would have to be replicated.
You mean WDQS has a Text Search interface component that's intertwined with the Query Service provided by the Wikidata SPARQL Endpoint?
As a policy, we don't send traffic to any third party, so we will not setup such an instance.
The exposed SPARQL endpoint is at the moment a direct exposition of the Blazegraph endpoint, so it does expose all the Blazegraph specific features and quirks.
Is there a Query Service that's separated from the Blazegraph endpoint? The crux of the matter here is that WDQS benefits more by being loosely- bound to endpoints rather than tightly-bound to the Blazegraph endpoint. It depends what you mean by Query Service. My definition of a Query Service in this context is a SPARQL endpoint with a specific data set.
Yes, but in the case of Wikidata that's a combination of both a SPARQL Query Service (query processor and endpoint) and WDQS query solution rendering services.
That SPARQL endpoint at the moment is Blazegraph. I'm not entirely clear what kind loose bound you'd like to see in this context. We might have different definitions of the same words here.
Loose-coupling, in the context I am describing, would comprise the following: 1. WDQS that can be bolted on to any SPARQL endpoint, just like YASGUI <https://github.com/TriplyDB/Yasgui#this> In this context, I would say "WDQS UI can be bolted to any SPARQL endpoint". In term of SPARQL itself, that should already be mostly the case. I think there is a dependency on Search as well.
As per my earlier comment, I don't quite understand what you are referring to regarding the Search (Free Text Querying) intermingling. Does this relate to SPARQL Query Patterns comprising literal objects? If so, WDQS should be able to constrain such behavior to Blazegraph instances -- by way of configuration that informs introspection.
WDQS UI relies on a Search endpoint (backed by Elasticsearch) for auto completion. The requirements of low latency and reasonable ranking are something that Elasticsearch (or another Search oriented backend) does really well. But I would not expect an RDF backend to offer good ranking heuristics.
Virtuoso as always included text ranking as part of its native free text indexing functionality. That said, these are back-end details that WDQS should be loosely-bound to via configuration.
Example:
The key thing here is too decouple WDQS such that it can work with other back-ends en route to a much more resilient federation of Wikidata Knowledge Graph instances.
There's too much Blazegraph specificity in place right now.
This is an important topic. Let's migrate off of Blazegraph.
No, really: what's the status of WDQS backend updates https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update, like risk projections and timelines for migration? [1 https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/thread/B4GTI6TDEKS7Q2OMJR26XWLFYMUXSR6F/#YORJB4AYYSUFSYM7H3VTZSOZBC4GTEOZ ]
Guillaume Lederrey glederrey@wikimedia.org wrote:
- Near real-time data streams usable by 3rd Party Wikidata hosts
Could you please open a Phabricator task to document what you would like to see exposed and why it would be useful?
I started a ticket: https://phabricator.wikimedia.org/T330521 Anyone interested, please edit as needed.
On 2/24/23 2:25 PM, Samuel Klein wrote:
This is an important topic. Let's migrate off of Blazegraph.
No, really: what's the status of WDQS backend updates https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update, like risk projections and timelines for migration? [1 https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/thread/B4GTI6TDEKS7Q2OMJR26XWLFYMUXSR6F/#YORJB4AYYSUFSYM7H3VTZSOZBC4GTEOZ]
Guillaume Lederrey glederrey@wikimedia.org wrote:
2. Near real-time data streams usable by 3rd Party Wikidata hosts Could you please open a Phabricator task to document what you would like to see exposed and why it would be useful?
I started a ticket: https://phabricator.wikimedia.org/T330521 Anyone interested, please edit as needed.
Hi Samuel,
Thanks for opening that up ticket!
Kingsley
Wikidata mailing list --wikidata@lists.wikimedia.org Public archives athttps://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email towikidata-leave@lists.wikimedia.org
Hi Guillaume,
Which file system is used with Blazegraph? Is it NFS or Ext4, etc.? Specifically, the file system used where Journal files are written and read from? [1] Because looking at the code, it seems there could be cases where unreported errors can happen around file locking.
[1] https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/...
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Wed, Feb 22, 2023 at 5:06 AM Guillaume Lederrey glederrey@wikimedia.org wrote:
Hello all!
TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..
Longer version:
WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.
The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]
We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.
Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110
-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/ _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
On Wed, 22 Feb 2023 at 04:45, Thad Guidry thadguidry@gmail.com wrote:
Hi Guillaume,
Which file system is used with Blazegraph? Is it NFS or Ext4, etc.? Specifically, the file system used where Journal files are written and read from? [1] Because looking at the code, it seems there could be cases where unreported errors can happen around file locking.
We are using Ext4. I don't understand enough about the Blazegraph internals to know if that might be an issue or not. But given your question, I assume that the locking issues are probably more related to running on NFS.
[1] https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/...
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Wed, Feb 22, 2023 at 5:06 AM Guillaume Lederrey < glederrey@wikimedia.org> wrote:
Hello all!
TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding and patience..
Longer version:
WDQS is updated from a stream of recent changes on Wikidata, with a maximum delay of ~2 minutes. This process was improved as part of the WDQS Streaming Updater project to ensure data coherence[1] . However, the update process is still imperfect and can lead to data inconsistencies in some cases[2][3]. To address this, we reload the data from dumps a few times per year to reinitialize the system from a known good state.
The recent reload of data from dumps started in mid-December and was initially met with some issues related to download and instabilities in Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph takes a couple of weeks due to the size of the graph, and we had multiple attempts where the reload failed after >90% of the data had been loaded. Our understanding of the issue is that a "race condition" in Blazegraph[5], where subtle timing changes lead to corruption of the journal in some rare cases, is to blame.[6]
We want to reassure you that the last reload job was successful on one of our servers. The data still needs to be copied over to all of the WDQS servers, which will take a couple of weeks, but should not bring any additional issues. However, reloading the full data from dumps is becoming more complex as the data size grows, and we wanted to let you know why the process took longer than expected. We understand that data inconsistencies can be problematic, and we appreciate your patience and understanding while we work to ensure the quality and consistency of the data on WDQS.
Thank you for your continued support and understanding!
Guillaume
[1] https://phabricator.wikimedia.org/T244590 [2] https://phabricator.wikimedia.org/T323239 [3] https://phabricator.wikimedia.org/T322869 [4] https://phabricator.wikimedia.org/T323096 [5] https://en.wikipedia.org/wiki/Race_condition#In_software [6] https://phabricator.wikimedia.org/T263110
-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/ _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org