I’d like to provide a bit more background and summarize a bit our work on
the new WDQS updater, from the technical perspective.
It has been common knowledge that the old updater had its issues. Main ones
Low throughput, that often caused huge spikes of lag that were very hard
to get down from (this is a nice example - ).
Reliance on Blazegraph to reconcile the data - Blazegraph’s reads
affects writes and vice versa, which quite often caused a cascading failure
for both update latency and query performance.
Ineffective handling of eventual consistency - this was one of the
reasons for missing data in WDQS. What is worse is that we had very low
visibility of what goes missing.
We’ll be publishing a series of blog posts that will provide a more
in-depth description of the architecture and the challenges during
development - stay tuned!
In the meantime, I want to explain a few things that about the new updater:
Higher best case lag is the result of the decisions of trading low
latency for high consistency - considering the data we lost with the old
updater, we think this approach is better in our situation. We would rather
have a complete data set than a faster incomplete one. To make sure that
we’re keeping the lag manageable, we introduced an SLO  and will
introduce alerting on the lag being under 10 minutes.
Data is reconciled within the pipeline, which has a dramatically lower
effect on Blazegraph. This should help with the updates, which was the
goal, but also positively affects query engine stability.
As we previously mentioned in the general announcement, the difference
in throughput is substantial (10 edits/sec vs 88 edits/sec) - which means
a much faster catch up and more room to grow for Wikidata. The new updater
can be scaled even more if necessary.
The new Streaming Updater didn’t resolve all the issues magically and there
are still two main ones, that we need to address:
Data loss - while the reconciliation mechanism works better than with
the old updater, we literally lost updates without any way of knowing about
it, other than user feedback -  . This is a really bad way of finding
out about issues. The new Streaming Updater can still miss data, especially
due to late events or eventual consistency, as mentioned before. One thing
that changed, however, is that the new updater has better
inconsistency/late event reporting, which allows us to build a subsystem
around it to reconcile the data. More information here - .
Blazegraph instability - no matter how fast and stable the new updater
might be, Blazegraph is still the last node in the process. That means that
the whole update process will be affected by Blazegraph’s instability and
will in turn produce a lag. One of the most common reasons for that
instability is a so-called “GC death spiral”. A server in that state won’t
answer any queries (which is a problem in itself), but after restarting,
the lag will be high for some time. We are investigating a solution that
can help us with this - .
I hope that answers at least some of the concerns already raised. Rest
assured that we are working on way more things to improve the experience
than the updater, all of which is, as always, available to see on our
backlog board () and workboard ().
Any and all feedback welcome!
Zbyszko Papierski (He/Him)
Senior Software Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>