Hi Guillaume,
Which file system is used with Blazegraph? Is it NFS or Ext4, etc.?
Specifically, the file system used where Journal files are written and read
from? [1]
Because looking at the code, it seems there could be cases where unreported
errors can happen around file locking.
[1]
On Wed, Feb 22, 2023 at 5:06 AM Guillaume Lederrey <glederrey(a)wikimedia.org>
wrote:
Hello all!
TL;DR: We expect to successfully complete the recent data reload on
Wikidata Query Service soon, but we've encountered multiple failures
related to the size of the graph, and anticipate that this issue may worsen
in the future. Although we succeeded this time, we cannot guarantee that
future reload attempts will be successful given the current trend of the
data reload process. Thank you for your understanding and patience..
Longer version:
WDQS is updated from a stream of recent changes on Wikidata, with a
maximum delay of ~2 minutes. This process was improved as part of the WDQS
Streaming Updater project to ensure data coherence[1] . However, the update
process is still imperfect and can lead to data inconsistencies in some
cases[2][3]. To address this, we reload the data from dumps a few times per
year to reinitialize the system from a known good state.
The recent reload of data from dumps started in mid-December and was
initially met with some issues related to download and instabilities in
Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph
takes a couple of weeks due to the size of the graph, and we had multiple
attempts where the reload failed after >90% of the data had been loaded.
Our understanding of the issue is that a "race condition" in Blazegraph[5],
where subtle timing changes lead to corruption of the journal in some rare
cases, is to blame.[6]
We want to reassure you that the last reload job was successful on one of
our servers. The data still needs to be copied over to all of the WDQS
servers, which will take a couple of weeks, but should not bring any
additional issues. However, reloading the full data from dumps is becoming
more complex as the data size grows, and we wanted to let you know why the
process took longer than expected. We understand that data inconsistencies
can be problematic, and we appreciate your patience and understanding while
we work to ensure the quality and consistency of the data on WDQS.
Thank you for your continued support and understanding!
Guillaume
[1]
https://phabricator.wikimedia.org/T244590
[2]
https://phabricator.wikimedia.org/T323239
[3]
https://phabricator.wikimedia.org/T322869
[4]
https://phabricator.wikimedia.org/T323096
[5]
https://en.wikipedia.org/wiki/Race_condition#In_software
[6]
https://phabricator.wikimedia.org/T263110
--
*Guillaume Lederrey* (he/him)
Engineering Manager
Wikimedia Foundation <https://wikimediafoundation.org/>
_______________________________________________
Wikidata mailing list -- wikidata(a)lists.wikimedia.org
Public archives at
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/me…
To unsubscribe send an email to wikidata-leave(a)lists.wikimedia.org