Re: [Wikidata] Status of Wikidata Query Service

11 Feb 2020

Hoi,
I find it interesting that some do not need to see the result of what they
do. What it tells me is that they deal with collections, big amounts of
data that like stamp collections are dumped in Wikidata. What is the point
of such?  I consider them prime examples of what can be set aside.

I do need to know what the effects are of what i do. I add single items,
link them to other items like awards and papers and use tools like Scholia
to consider the effects. I blog regularly and typically it is based on the
results that I see of what I do. It is of profound importance to people who
edit like me that there is no lag.

An other thing to consider is that given the bias in our projects, the
worse thing we can do is make ghettos of everything non English. It also
totally destroys my approach where I have listeria lists about Africa so
that we can follow what is known about Africa in Wikidata.. [1]

Again, what I notice is that the underperformance, the stagnance of
Wikidata is only considered as a technical issue. It has a huge effect on
how Wikidata may be used it is detrimental to all Wikimedia projects and
therefore it deserves a reaction of the board, the director of the
Wikimedia Foundation.
Thanks,
     GerardM

[1] https://en.wikipedia.org/wiki/User:GerardM/Africa

So

On Mon, 10 Feb 2020 at 17:11, Amirouche Boubekki <
amirouche.boubekki(a)gmail.com&gt; wrote:

...
  Hello Guillaume,

 Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey
 &lt;glederrey(a)wikimedia.org&gt; a écrit :

 Hello all!

 First of all, my apologies for the long silence. We need to do better in  terms of
communication. I'll try my best to send a monthly update from now
 on. Keep me honest, remind me if I fail.

 It will be nice to have some feedback on my grant request at:

   https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS

 Or one of the other threads on the very same mailing list.

  Another attempt to get update lag under control
is to apply back  pressure on edits, by adding the WDQS update lag to the Wikdiata
maxlag
 [6]. This is obviously less than ideal (at least as long as WDQS updates
 are lagging as often as they are), but does allow the service to recover
 from time to time. We probably need to iterate on this, provide better
 granularity, differentiate better between operations that have an impact on
 update lag and those which don't.
   > On the slightly better news side, we
now have a much better
 understanding of the update process and of its shortcomings. The current
 process does a full diff between each updated entity and what we have in
 blazegraph. Even if a single triple needs to change, we still read tons of
 data from Blazegraph. While this approach is simple and robust, it is
 obviously not efficient. We need to rewrite the updater to take a more
 event streaming / reactive approach, and only work on the actual changes.

 When it will be done, it will be still a short term solution

  This is a big chunk of work, almost a complete
rewrite of the updater, 
  and we need a new solution to stream changes with
guaranteed ordering  (something that our kafka queues don't offer). This is
where we are
 focusing our energy at the moment, this looks like the best option to
 improve the situation in the medium term. This change will probably have
 some functional impacts [3].

 Guaranteed ordering in a multi-party distributed setting has no easy
 solution, and apparently it is not provided by Kafka.  For a
 non-technical person, you can read
 https://en.wikipedia.org/wiki/Two_Generals%27_Problem

 > Some longer term thoughts:
   > Keeping all of Wikidata in a single
graph is most probably not going to
 work long term.

 :(

  We have not found examples of public SPARQL
endpoints with > 10 B  triples and there is probably a good reason for that.

 Because Wikimedia is the only non-profit in the field?

  We will probably need to split the graphs at some
point. 
 :(

  We don't know how yet 
 :(

  (that's why we loaded the dumps into Hadoop,
that might give us some  more insight).

 :(

  We might expose a subgraph with only truthy
statements. Or have  language-specific graphs, with only language-specific labels.

 :(

  Or something completely different. 
 :)

  Keeping WDQS / Wikidata as open as they are at
the moment might not be  possible in the long term. We need to think if / how we
want to implement
 some form of authentication and quotas.

 With blacklists and whitelists, but this is huge anyway.

  Potentially increasing quotas for some use cases,
but keeping them  strict for others. Again, we don't know how this will look
like, but we're
 thinking about it.

 > What you can do to help:
   > Again, we're not sure. Of course,
reducing the load (both in terms of
 edits on Wikidata and of reads on WDQS) will help. But not using those
 services makes them useless.

 What about making the lag part of the service.  I mean, you could
 reload WDQS periodically, for instance daily, and drop the updater
 altogether. Who needs to see the updates live in WDQS as soon as edits
 are done in wikidata?

  We suspect that some use cases are more expensive
than others (a single  property change to a large entity will require a
comparatively insane
 amount of work to update it on the WDQS side). We'd like to have real data
 on the cost of various operations, but we only have guesses at this point.
   > If you've read this far, thanks a
lot for your engagement!
   >   Have fun!

 Will do.

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Status of Wikidata Query Service