why all the sad faces? the Semantic Web will be distributed after all and
there is no need to stuff everything into one graph. it just requires us as
a RDF community to spend more time developing ideas around efficient query
distribution and focus on relationships and links in wikidata rather than
building a monolithic database for humongous arbitrary joins and table
scans as a free for all. The slogan "sum of all human knowledge" in one
place should not be taken too literally.
it's I believe what wikidata as a project already does in any event,
actually the sparql endpoint as an extension to the wikidata architecture
around wikibase should be used more pro-actively to connect multiple rdf
data providers for search. I would think that this is actually already a
common use case for wikidata users who enrich their remote queries with
wikidata data.
all that said it's quite an achivement to scale the wikidata sparql
endpoint to where it is now. congratulations to the team and I look forward
to seeing more of it in the future.
On Mon, Feb 10, 2020 at 4:11 PM Amirouche Boubekki <
amirouche.boubekki(a)gmail.com> wrote:
Hello Guillaume,
Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey
<glederrey(a)wikimedia.org> a écrit :
Hello all!
First of all, my apologies for the long silence. We need to do better in
terms of
communication. I'll try my best to send a monthly update from now
on. Keep me honest, remind me if I fail.
It will be nice to have some feedback on my grant request at:
https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
Or one of the other threads on the very same mailing list.
Another attempt to get update lag under control
is to apply back
pressure on edits, by adding the WDQS update lag to the Wikdiata
maxlag
[6]. This is obviously less than ideal (at least as long as WDQS updates
are lagging as often as they are), but does allow the service to recover
from time to time. We probably need to iterate on this, provide better
granularity, differentiate better between operations that have an impact on
update lag and those which don't.
> On the slightly better news side, we
now have a much better
understanding of the update process and of its shortcomings. The current
process does a full diff between each updated entity and what we have in
blazegraph. Even if a single triple needs to change, we still read tons of
data from Blazegraph. While this approach is simple and robust, it is
obviously not efficient. We need to rewrite the updater to take a more
event streaming / reactive approach, and only work on the actual changes.
When it will be done, it will be still a short term solution
This is a big chunk of work, almost a complete
rewrite of the updater,
and we need a new solution to stream changes with
guaranteed ordering
(something that our kafka queues don't offer). This is
where we are
focusing our energy at the moment, this looks like the best option to
improve the situation in the medium term. This change will probably have
some functional impacts [3].
Guaranteed ordering in a multi-party distributed setting has no easy
solution, and apparently it is not provided by Kafka. For a
non-technical person, you can read
https://en.wikipedia.org/wiki/Two_Generals%27_Problem
> Some longer term thoughts:
> Keeping all of Wikidata in a single
graph is most probably not going to
work long term.
:(
We have not found examples of public SPARQL
endpoints with > 10 B
triples and there is probably a good reason for that.
Because Wikimedia is the only non-profit in the field?
We will probably need to split the graphs at some
point.
:(
We don't know how yet
:(
(that's why we loaded the dumps into Hadoop,
that might give us some
more insight).
:(
We might expose a subgraph with only truthy
statements. Or have
language-specific graphs, with only language-specific labels.
:(
Or something completely different.
:)
Keeping WDQS / Wikidata as open as they are at
the moment might not be
possible in the long term. We need to think if / how we
want to implement
some form of authentication and quotas.
With blacklists and whitelists, but this is huge anyway.
Potentially increasing quotas for some use cases,
but keeping them
strict for others. Again, we don't know how this will look
like, but we're
thinking about it.
> What you can do to help:
> Again, we're not sure. Of course,
reducing the load (both in terms of
edits on Wikidata and of reads on WDQS) will help. But not using those
services makes them useless.
What about making the lag part of the service. I mean, you could
reload WDQS periodically, for instance daily, and drop the updater
altogether. Who needs to see the updates live in WDQS as soon as edits
are done in wikidata?
We suspect that some use cases are more expensive
than others (a single
property change to a large entity will require a
comparatively insane
amount of work to update it on the WDQS side). We'd like to have real data
on the cost of various operations, but we only have guesses at this point.
> If you've read this far, thanks a
lot for your engagement!
> Have fun!
Will do.
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata