On Thu, 23 Feb 2023 at 16:39, Kingsley
Idehen
<kidehen(a)openlinksw.com> wrote:
On 2/22/23 3:28 AM, Guillaume Lederrey wrote:
On Wed, 22 Feb 2023 at 00:03, Kingsley
Idehen via Wikidata
<wikidata(a)lists.wikimedia.org> wrote:
On 2/21/23 4:05 PM, Guillaume Lederrey wrote:
Hello all!
TL;DR: We expect to successfully complete the recent
data reload on
Wikidata Query Service soon, but we've
encountered
multiple failures
related to the size of the graph, and anticipate
that
this issue may
worsen in the future. Although we succeeded this
time,
we cannot
guarantee that future reload attempts will be
successful given the
current trend of the data reload process. Thank
you
for your
understanding and patience..
Longer version:
WDQS is updated from a stream of recent changes on
Wikidata, with a
maximum delay of ~2 minutes. This process was
improved
as part of the
WDQS Streaming Updater project to ensure data
coherence[1] . However,
the update process is still imperfect and can
lead to
data
inconsistencies in some cases[2][3]. To address
this,
we reload the
data from dumps a few times per year to
reinitialize
the system from a
known good state.
The recent reload of data from dumps started in
mid-December and was
initially met with some issues related to
download and
instabilities
in Blazegraph, the database used by WDQS[4].
Loading
the data into
Blazegraph takes a couple of weeks due to the
size of
the graph, and
we had multiple attempts where the reload failed
after
90% of the
data had been loaded. Our understanding of the issue
is that a "race
condition" in Blazegraph[5], where subtle
timing
changes lead to
corruption of the journal in some rare cases, is
to
blame.[6]
We want to reassure you that the last reload job was
successful on one
of our servers. The data still needs to be copied
over
to all of the
WDQS servers, which will take a couple of weeks,
but
should not bring
any additional issues. However, reloading the
full
data from dumps is
becoming more complex as the data size grows, and
we
wanted to let you
know why the process took longer than expected.
We
understand that
data inconsistencies can be problematic, and we
appreciate your
patience and understanding while we work to
ensure the
quality and
https://en.wikipedia.org/wiki/Race_condition#In_software
Hi Guillaume,
Are there plans to decouple WDQS from the back-end
database? Doing that
provides more resilient architecture for Wikidata as a
whole since you
will be able to swap and interchange SPARQL-compliant
backends.
It depends what you mean by decoupling. The coupling points
as I see them are:
* update process
* UI
* exposed SPARQL endpoint
The update process is mostly decoupled from the backend. It
is producing a stream of RDF updates that is backend
independent, with a very thin Blazegraph specific adapted to
load the data into Blazegraph.
Does that mean that we could integrate the RDF stream into
our setup re keeping our Wikidata instance up to date, for
instance?
That data stream isn't exposed publicly. There are a few tricky
part about the stream needing to be synchronized with a specific
Wikidata dump that makes it not entirely trivial to reuse outside
of our internal use case. But if there is enough interest, we
could potentially work on making that stream public.