On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
Hello all!
TL;DR: We expect to successfully complete the recent data reload on
Wikidata Query Service soon, but we've encountered multiple failures
related to the size of the graph, and anticipate that this issue may
worsen in the future. Although we succeeded this time, we cannot
guarantee that future reload attempts will be successful given the
current trend of the data reload process. Thank you for your
understanding and patience..
Longer version:
WDQS is updated from a stream of recent changes on Wikidata, with a
maximum delay of ~2 minutes. This process was improved as part of the
WDQS Streaming Updater project to ensure data coherence[1] . However,
the update process is still imperfect and can lead to data
inconsistencies in some cases[2][3]. To address this, we reload the
data from dumps a few times per year to reinitialize the system from a
known good state.
The recent reload of data from dumps started in mid-December and was
initially met with some issues related to download and instabilities
in Blazegraph, the database used by WDQS[4]. Loading the data into
Blazegraph takes a couple of weeks due to the size of the graph, and
we had multiple attempts where the reload failed after >90% of the
data had been loaded. Our understanding of the issue is that a "race
condition" in Blazegraph[5], where subtle timing changes lead to
corruption of the journal in some rare cases, is to blame.[6]
We want to reassure you that the last reload job was successful on one
of our servers. The data still needs to be copied over to all of the
WDQS servers, which will take a couple of weeks, but should not bring
any additional issues. However, reloading the full data from dumps is
becoming more complex as the data size grows, and we wanted to let you
know why the process took longer than expected. We understand that
data inconsistencies can be problematic, and we appreciate your
patience and understanding while we work to ensure the quality and
consistency of the data on WDQS.
Thank you for your continued support and understanding!
Guillaume
[1]
https://phabricator.wikimedia.org/T244590
[2]
https://phabricator.wikimedia.org/T323239
[3]
https://phabricator.wikimedia.org/T322869
[4]
https://phabricator.wikimedia.org/T323096
[5]
https://en.wikipedia.org/wiki/Race_condition#In_software
[6]
https://phabricator.wikimedia.org/T263110
Hi Guillaume,
Are there plans to decouple WDQS from the back-end database? Doing that
provides more resilient architecture for Wikidata as a whole since you
will be able to swap and interchange SPARQL-compliant backends.
BTW -- we are going to make AWS and even Azure hosted instances (offered
on a PAGO basis) of our Virtuoso-hosted edition of Wikidata (which we
recently reloaded).
--
Regards,
Kingsley Idehen
Founder & CEO
OpenLink Software
Home Page:
http://www.openlinksw.com
Community Support:
https://community.openlinksw.com
Weblogs (Blogs):
Company Blog:
https://medium.com/openlink-software-blog
Virtuoso Blog:
https://medium.com/virtuoso-blog
Data Access Drivers Blog:
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
Personal Weblogs (Blogs):
Medium Blog:
https://medium.com/@kidehen
Legacy Blogs:
http://www.openlinksw.com/blog/~kidehen/
http://kidehen.blogspot.com
Profile Pages:
Pinterest:
https://www.pinterest.com/kidehen/
Quora:
https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter:
https://twitter.com/kidehen
Google+:
https://plus.google.com/+KingsleyIdehen/about
LinkedIn:
http://www.linkedin.com/in/kidehen
Web Identities (WebID):
Personal:
http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
:
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this