Follow-up on the new WDQS Updater - Wikidata-tech

27 Oct 2021

Hello!

I’d like to provide a bit more background and summarize a bit our work on
the new WDQS updater, from the technical perspective.

It has been common knowledge that the old updater had its issues. Main ones
among them:

   -

   Low throughput, that often caused huge spikes of lag that were very hard
   to get down from (this is a nice example - [1]).
   -

   Reliance on Blazegraph to reconcile the data - Blazegraph’s reads
   affects writes and vice versa, which quite often caused a cascading failure
   for both update latency and query performance.
   -

   Ineffective handling of eventual consistency - this was one of the
   reasons for missing data in WDQS. What is worse is that we had very low
   visibility of what goes missing.

We’ll be publishing a series of blog posts that will provide a more
in-depth description of the architecture and the challenges during
development - stay tuned!

In the meantime, I want to explain a few things that about the new updater:

   -

   Higher best case lag is the result of the decisions of trading low
   latency for high consistency - considering the data we lost with the old
   updater, we think this approach is better in our situation. We would rather
   have a complete data set than a faster incomplete one. To make sure that
   we’re keeping the lag manageable, we introduced an SLO [2] and will
   introduce alerting on the lag being under 10 minutes.
   -

   Data is reconciled within the pipeline, which has a dramatically lower
   effect on Blazegraph. This should help with the updates, which was the
   goal, but also positively affects query engine stability.
   -

   As we previously mentioned in the general announcement, the difference
   in throughput is substantial (10 edits/sec vs 88 edits/sec)  - which means
   a much faster catch up and more room to grow for Wikidata. The new updater
   can be scaled even more if necessary.

The new Streaming Updater didn’t resolve all the issues magically and there
are still two main ones, that we need to address:

   -

   Data loss - while the reconciliation mechanism works better than with
   the old updater, we literally lost updates without any way of knowing about
   it, other than user feedback - [3] [4]. This is a really bad way of finding
   out about issues. The new Streaming Updater can still miss data, especially
   due to late events or eventual consistency, as mentioned before. One thing
   that changed, however, is that the new updater has better
   inconsistency/late event reporting, which allows us to build a subsystem
   around it to reconcile the data. More information here - [5].
   -

   Blazegraph instability - no matter how fast and stable the new updater
   might be, Blazegraph is still the last node in the process. That means that
   the whole update process will be affected by Blazegraph’s instability and
   will in turn produce a lag. One of the most common reasons for that
   instability is a so-called “GC death spiral”. A server in that state won’t
   answer any queries (which is a problem in itself), but after restarting,
   the lag will be high for some time. We are investigating a solution that
   can help us with this - [6].

I hope that answers at least some of the concerns already raised. Rest
assured that we are working on way more things to improve the experience
than the updater, all of which is, as always, available to see on our
backlog board ([7]) and workboard ([8]).

Any and all feedback welcome!

Regards,

Zbyszko

[1]
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=…

[2] https://grafana-rw.wikimedia.org/d/yCBd7Tdnk/wdqs-lag-slo

[3] https://phabricator.wikimedia.org/T272120

[4] https://phabricator.wikimedia.org/T291609

[5] https://phabricator.wikimedia.org/T279541

[6] https://phabricator.wikimedia.org/T293862

[7] https://phabricator.wikimedia.org/tag/wikidata-query-service/

[8] https://phabricator.wikimedia.org/project/view/1227/

-- 

Zbyszko Papierski (He/Him)

Senior Software Engineer

Wikimedia Foundation <https://wikimediafoundation.org/>