Hello!
I’d like to provide a bit more background and summarize a bit our work on the new WDQS updater, from the technical perspective.
It has been common knowledge that the old updater had its issues. Main ones among them:
Low throughput, that often caused huge spikes of lag that were very hard to get down from (this is a nice example - [1]).
Reliance on Blazegraph to reconcile the data - Blazegraph’s reads affects writes and vice versa, which quite often caused a cascading failure for both update latency and query performance.
Ineffective handling of eventual consistency - this was one of the reasons for missing data in WDQS. What is worse is that we had very low visibility of what goes missing.
We’ll be publishing a series of blog posts that will provide a more in-depth description of the architecture and the challenges during development - stay tuned!
In the meantime, I want to explain a few things that about the new updater:
Higher best case lag is the result of the decisions of trading low latency for high consistency - considering the data we lost with the old updater, we think this approach is better in our situation. We would rather have a complete data set than a faster incomplete one. To make sure that we’re keeping the lag manageable, we introduced an SLO [2] and will introduce alerting on the lag being under 10 minutes.
Data is reconciled within the pipeline, which has a dramatically lower effect on Blazegraph. This should help with the updates, which was the goal, but also positively affects query engine stability.
As we previously mentioned in the general announcement, the difference in throughput is substantial (10 edits/sec vs 88 edits/sec) - which means a much faster catch up and more room to grow for Wikidata. The new updater can be scaled even more if necessary.
The new Streaming Updater didn’t resolve all the issues magically and there are still two main ones, that we need to address:
Data loss - while the reconciliation mechanism works better than with the old updater, we literally lost updates without any way of knowing about it, other than user feedback - [3] [4]. This is a really bad way of finding out about issues. The new Streaming Updater can still miss data, especially due to late events or eventual consistency, as mentioned before. One thing that changed, however, is that the new updater has better inconsistency/late event reporting, which allows us to build a subsystem around it to reconcile the data. More information here - [5].
Blazegraph instability - no matter how fast and stable the new updater might be, Blazegraph is still the last node in the process. That means that the whole update process will be affected by Blazegraph’s instability and will in turn produce a lag. One of the most common reasons for that instability is a so-called “GC death spiral”. A server in that state won’t answer any queries (which is a problem in itself), but after restarting, the lag will be high for some time. We are investigating a solution that can help us with this - [6].
I hope that answers at least some of the concerns already raised. Rest assured that we are working on way more things to improve the experience than the updater, all of which is, as always, available to see on our backlog board ([7]) and workboard ([8]).
Any and all feedback welcome!
Regards,
Zbyszko
[2] https://grafana-rw.wikimedia.org/d/yCBd7Tdnk/wdqs-lag-slo
[3] https://phabricator.wikimedia.org/T272120
[4] https://phabricator.wikimedia.org/T291609
[5] https://phabricator.wikimedia.org/T279541
[6] https://phabricator.wikimedia.org/T293862
[7] https://phabricator.wikimedia.org/tag/wikidata-query-service/
[8] https://phabricator.wikimedia.org/project/view/1227/
Zbyszko Papierski (He/Him) Senior Software Engineer |