Re: [Wikidata] Status of Wikidata Query Service

7 Feb 2020

On Fri, Feb 7, 2020 at 2:54 PM Marco Neumann &lt;marco.neumann(a)gmail.com&gt;
wrote:

...
  thank you Guillaume, when do you expect a public
update on the security
 incident [1]? Is any of our personal and private data (email, password etc)
 affected?

It should be made public in the next few days. I'm not going to go into any
more details until this is made public, but overall, don't worry too much.

...
  best,
 Marco

 [1] https://phabricator.wikimedia.org/T241410

 On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey &lt;glederrey(a)wikimedia.org&gt;
 wrote:

  Hello all!

 First of all, my apologies for the long silence. We need to do better in
 terms of communication. I'll try my best to send a monthly update from now
 on. Keep me honest, remind me if I fail.

 First, we had a security incident at the end of December, which forced us
 to move from our Kafka based update stream back to the RecentChanges
 poller. The details are still private, but you will be able to get the full
 story soon on phabricator [1]. The RecentChange poller is less efficient
 and this is leading to high update lag again (just when we thought we had
 things slightly under control). We tried to mitigate this by improving the
 parallelism in the updater [2], which helped a bit, but not as much as we
 need.

 Another attempt to get update lag under control is to apply back pressure
 on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is
 obviously less than ideal (at least as long as WDQS updates are lagging as
 often as they are), but does allow the service to recover from time to
 time. We probably need to iterate on this, provide better granularity,
 differentiate better between operations that have an impact on update lag
 and those which don't.

 On the slightly better news side, we now have a much better understanding
 of the update process and of its shortcomings. The current process does a
 full diff between each updated entity and what we have in blazegraph. Even
 if a single triple needs to change, we still read tons of data from
 Blazegraph. While this approach is simple and robust, it is obviously not
 efficient. We need to rewrite the updater to take a more event streaming /
 reactive approach, and only work on the actual changes. This is a big chunk
 of work, almost a complete rewrite of the updater, and we need a new
 solution to stream changes with guaranteed ordering (something that our
 kafka queues don't offer). This is where we are focusing our energy at the
 moment, this looks like the best option to improve the situation in the
 medium term. This change will probably have some functional impacts [3].

 Some misc things:

 We have done some work to get better metrics and better understanding of
 what's going on. From collecting more metrics during the update [4] to
 loading RDF dumps into Hadoop for further analysis [5] and better logging
 of SPARQL requests. We are not focusing on this analysis until we are in a
 more stable situation regarding update lag.

 We have a new team member working on WDQS. He is still ramping up, but we
 should have a bit more capacity from now on.

 Some longer term thoughts:

 Keeping all of Wikidata in a single graph is most probably not going to
 work long term. We have not found examples of public SPARQL endpoints with
  10 B triples and there is probably a good reason
for that. We will  probably need to split the graphs at some point. We don't
know how yet
 (that's why we loaded the dumps into Hadoop, that might give us some more
 insight). We might expose a subgraph with only truthy statements. Or have
 language specific graphs, with only language specific labels. Or something
 completely different.

 Keeping WDQS / Wikidata as open as they are at the moment might not be
 possible in the long term. We need to think if / how we want to implement
 some form of authentication and quotas. Potentially increasing quotas for
 some use cases, but keeping them strict for others. Again, we don't know
 how this will look like, but we're thinking about it.

 What you can do to help:

 Again, we're not sure. Of course, reducing the load (both in terms of
 edits on Wikidata and of reads on WDQS) will help. But not using those
 services makes them useless.

 We suspect that some use cases are more expensive than others (a single
 property change to a large entity will require a comparatively insane
 amount of work to update it on the WDQS side). We'd like to have real data
 on the cost of various operations, but we only have guesses at this point.

 If you've read this far, thanks a lot for your engagement!

   Have fun!

       Guillaume

 [1] https://phabricator.wikimedia.org/T241410
 [2] https://phabricator.wikimedia.org/T238045
 [3] https://phabricator.wikimedia.org/T244341
 [4] https://phabricator.wikimedia.org/T239908
 [5] https://phabricator.wikimedia.org/T241125
 [6] https://phabricator.wikimedia.org/T221774

 --
 Guillaume Lederrey
 Engineering Manager, Search Platform
 Wikimedia Foundation
 UTC+1 / CET
 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

 --

 ---
 Marco Neumann
 KONA

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Status of Wikidata Query Service