Re: [Wikidata] Status of Wikidata Query Service

7 Feb 2020

Better update granularity is probably good and may be a good priority.

It is (still) unclear for me as a tool writer whether I can do anything. 
For instance it is not clear to me whether the parallel SPARQL queries 
that comes when a user visit a Scholia page is important for the load on 
WDQS (not likely) or it is miniscule (likely).

As far as I understand on http://ceur-ws.org/Vol-2073/article-03.pdf 
much of the query load comes via Magnus. I presume another big chunk is 
from the genewiki people.

If robotic queries are sources of problems then tool writers/users can 
do something. But fixing issues would require the WMF to tell if it 
really is a problem and what the problems are.

best regards
Finn

On 07/02/2020 14:32, Guillaume Lederrey wrote:
...
  Hello all!

 First of all, my apologies for the long silence. We need to do better in 
 terms of communication. I'll try my best to send a monthly update from 
 now on. Keep me honest, remind me if I fail.

 First, we had a security incident at the end of December, which forced 
 us to move from our Kafka based update stream back to the RecentChanges 
 poller. The details are still private, but you will be able to get the 
 full story soon on phabricator [1]. The RecentChange poller is less 
 efficient and this is leading to high update lag again (just when we 
 thought we had things slightly under control). We tried to mitigate this 
 by improving the parallelism in the updater [2], which helped a bit, but 
 not as much as we need.

 Another attempt to get update lag under control is to apply back 
 pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag 
 [6]. This is obviously less than ideal (at least as long as WDQS updates 
 are lagging as often as they are), but does allow the service to recover 
 from time to time. We probably need to iterate on this, provide better 
 granularity, differentiate better between operations that have an impact 
 on update lag and those which don't.

 On the slightly better news side, we now have a much better 
 understanding of the update process and of its shortcomings. The current 
 process does a full diff between each updated entity and what we have in 
 blazegraph. Even if a single triple needs to change, we still read tons 
 of data from Blazegraph. While this approach is simple and robust, it is 
 obviously not efficient. We need to rewrite the updater to take a more 
 event streaming / reactive approach, and only work on the actual 
 changes. This is a big chunk of work, almost a complete rewrite of the 
 updater, and we need a new solution to stream changes with guaranteed 
 ordering (something that our kafka queues don't offer). This is where we 
 are focusing our energy at the moment, this looks like the best option 
 to improve the situation in the medium term. This change will probably 
 have some functional impacts [3].

 Some misc things:

 We have done some work to get better metrics and better understanding of 
 what's going on. From collecting more metrics during the update [4] to 
 loading RDF dumps into Hadoop for further analysis [5] and better 
 logging of SPARQL requests. We are not focusing on this analysis until 
 we are in a more stable situation regarding update lag.

 We have a new team member working on WDQS. He is still ramping up, but 
 we should have a bit more capacity from now on.

 Some longer term thoughts:

 Keeping all of Wikidata in a single graph is most probably not going to 
 work long term. We have not found examples of public SPARQL endpoints 
 with > 10 B triples and there is probably a good reason for that. We 
 will probably need to split the graphs at some point. We don't know how 
 yet (that's why we loaded the dumps into Hadoop, that might give us some 
 more insight). We might expose a subgraph with only truthy statements. 
 Or have language specific graphs, with only language specific labels. Or 
 something completely different.

 Keeping WDQS / Wikidata as open as they are at the moment might not be 
 possible in the long term. We need to think if / how we want to 
 implement some form of authentication and quotas. Potentially increasing 
 quotas for some use cases, but keeping them strict for others. Again, we 
 don't know how this will look like, but we're thinking about it.

 What you can do to help:

 Again, we're not sure. Of course, reducing the load (both in terms of 
 edits on Wikidata and of reads on WDQS) will help. But not using those 
 services makes them useless.

 We suspect that some use cases are more expensive than others (a single 
 property change to a large entity will require a comparatively insane 
 amount of work to update it on the WDQS side). We'd like to have real 
 data on the cost of various operations, but we only have guesses at this 
 point.

 If you've read this far, thanks a lot for your engagement!

    Have fun!

        Guillaume

 [1] https://phabricator.wikimedia.org/T241410
 [2] https://phabricator.wikimedia.org/T238045
 [3] https://phabricator.wikimedia.org/T244341
 [4] https://phabricator.wikimedia.org/T239908
 [5] https://phabricator.wikimedia.org/T241125
 [6] https://phabricator.wikimedia.org/T221774

 -- 
 Guillaume Lederrey
 Engineering Manager, Search Platform
 Wikimedia Foundation
 UTC+1 / CET

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Status of Wikidata Query Service