Hello all!
Here is a short update on what we have been doing around WDQS lately.
The update lag over the last 30 days has been slightly better [1]. We have not done anything more to improve it, or to analyze why it was less problematic lately. My guess is that it is a combination of being lucky and that the self throttling of edit based on the WDQS lag exposed through the Wikidata API.
We are now collecting more metrics from the WDQS updater [2] and exposing them through a new dashboard [3]. We are also collecting queries for analysis. Our hope is that digging into those queries (when we'll have time) will allow us to discover patterns of queries that might be better served with a different solution than Blazegraph.
We have loaded Wikidata dumps in Hadoop. This allows us to run analysis that would not be possible with Blazegraph. For example, we ran an analysis of the usage of common qualifiers for “unknown value” [4]. https://phabricator.wikimedia.org/T246238
There is an ongoing discussion about the use of blank nodes [5]. Blank nodes are problematic for our updater, as finding them is by design a non trivial operation. The discussion is still ongoing, but it is likely that we will need to introduce a breaking change in the way we are using blank nodes. We will provide an update once we know more precisely what we need to do and we have a migration path for use cases using them.
We are now focused on a complete rewrite of the WDQS Updater [6]. We are investigating using Flink [7] as a stream processing solution. This should allow us to both simplify the update process a lot and make it a lot more efficient. There is still a lot of work to be done before this is complete, but we think we have a good path forward.
Misc:
* some aliases for Wikidata have been deployed [8]
As always, thank you for your patience!
Guillaume
[1] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&... [2] https://phabricator.wikimedia.org/T239908 [3] https://grafana.wikimedia.org/d/dSksY08Zk/wikidata-query-service-updater?org... [4] https://phabricator.wikimedia.org/T246238 [5] https://phabricator.wikimedia.org/T244341 [6] https://phabricator.wikimedia.org/T244590 [7] https://flink.apache.org/ [8] https://phabricator.wikimedia.org/T222321
You'll love Flink. I'd encourage using Apache Beam on top of Flink and use the unified API. That way you can take advantage of Java AND Python and Go (something that will be important for your teams)
https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-t...