On Mon, Feb 10, 2020 at 1:53 PM Amirouche Boubekki <amirouche.boubekki@gmail.com> wrote:

Le lun. 10 févr. 2020 à 18:23, Marco Neumann <marco.neumann@gmail.com> a écrit :
>
> why all the sad faces?

> the Semantic Web will be distributed after all

The semantic Web is already distributed.

> and there is no need to stuff everything into one graph.

Everything into one graph, or if you prefer in one place, is the gist
of the idea of a library or encyclopedia.

> it just requires us as an RDF community to spend more time developing ideas around efficient query distribution

Maybe. But does not preclude the aggregation or sum of knowledge to happen.

> and focus on relationships and links in wikidata

Like I wrote above, a distributed knowledge base is already the state
of the things. I am not sure how to understand that part of the
sentence.

> rather than building a monolithic database

That is the gist of my proposal. Without the ability to run wikidata
at a small scale, WMF will fail at knowledge equity.

> for humongous arbitrary joins and table scans

I proposed something along the lines of
https://linkeddatafragments.org as known as "thin server, thick
client" I had no feedback :(

> as a free for all.

With that, I heartily agree. With the ability to downscale wikidata
infrastructure, and make companies and institutions pay for the stream
of changes to apply to their local instance, it will make things much
easier.

> The slogan "sum of all human knowledge" in one place should not be taken too literally.

I disagree.

>
> it's I believe what wikidata as a project already does in any event, actually, the SPARQL endpoint as an extension to the wikidata architecture around wikibase should be used more pro-actively to connect multiple RDF data providers for search.

Read my proposal at
https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS

The title is misleading, I intended to change it to Future-proof
WikiData. WDQS or querying is an integral part of wikidata and must
not be merely an addon.

> I would think that this is already a common use case for wikidata users who enrich their remote queries with wikidata data.

I do not understand. Yes, people enrich wikidata queries with their data. And?

> All that said it's quite an achievement to scale the wikidata SPARQL endpoint to where it is now.
> Congratulations to the team and I look forward to seeing more of it in the future.

Yes, I agree with that. Congratulations! I am very proud to be part
of the Wikimedia community.

The current WMF proposal that is called "sharding", see details at:

https://en.wikipedia.org/wiki/Shard_(database_architecture)

It is, not future proof. I have not done any analysis, but I bet that
most of the 2TB of wikidata is English, so even if you shard by
language, you will still end up with a gigantic graph. Also, most of
the data is not specific to a natural language, so one can not
possibly split the data by language.

If WMF comes up with another sharding strategy, how will edits that
span multiple regions will happen?

How will it make entering the wikidata party easier?

I dare to write in the open: it seems to me like we are witnessing
"Earth is flat vs. Earth is not flat" kind of event.

Thanks for the reply!

> On Mon, Feb 10, 2020 at 4:11 PM Amirouche Boubekki <amirouche.boubekki@gmail.com> wrote:
>>
>> Hello Guillaume,
>>
>> Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey
>> <glederrey@wikimedia.org> a écrit :
>> >
>> > Hello all!
>> >
>> > First of all, my apologies for the long silence. We need to do better in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
>> >
>>
>> It will be nice to have some feedback on my grant request at:
>>
>> https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
>>
>> Or one of the other threads on the very same mailing list.
>>
>> > Another attempt to get update lag under control is to apply back pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
>> >
>> > On the slightly better news side, we now have a much better understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes.
>>
>> When it will be done, it will be still a short term solution
>>
>> > This is a big chunk of work, almost a complete rewrite of the updater,
>>
>> > and we need a new solution to stream changes with guaranteed ordering (something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
>>
>> Guaranteed ordering in a multi-party distributed setting has no easy
>> solution, and apparently it is not provided by Kafka. For a
>> non-technical person, you can read
>> https://en.wikipedia.org/wiki/Two_Generals%27_Problem
>>
>> > Some longer term thoughts:
>> >
>> > Keeping all of Wikidata in a single graph is most probably not going to work long term.
>>
>> :(
>>
>> > We have not found examples of public SPARQL endpoints with > 10 B triples and there is probably a good reason for that.
>>
>> Because Wikimedia is the only non-profit in the field?
>>
>> > We will probably need to split the graphs at some point.
>>
>> :(
>>
>> > We don't know how yet
>>
>> :(
>>
>> > (that's why we loaded the dumps into Hadoop, that might give us some more insight).
>>
>> :(
>>
>> > We might expose a subgraph with only truthy statements. Or have language-specific graphs, with only language-specific labels.
>>
>> :(
>>
>> > Or something completely different.
>>
>> :)
>>
>> > Keeping WDQS / Wikidata as open as they are at the moment might not be possible in the long term. We need to think if / how we want to implement some form of authentication and quotas.
>>
>> With blacklists and whitelists, but this is huge anyway.
>>
>> > Potentially increasing quotas for some use cases, but keeping them strict for others. Again, we don't know how this will look like, but we're thinking about it.
>>
>> > What you can do to help:
>> >
>> > Again, we're not sure. Of course, reducing the load (both in terms of edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
>>
>> What about making the lag part of the service. I mean, you could
>> reload WDQS periodically, for instance daily, and drop the updater
>> altogether. Who needs to see the updates live in WDQS as soon as edits
>> are done in wikidata?
>>
>> > We suspect that some use cases are more expensive than others (a single property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
>> >
>> > If you've read this far, thanks a lot for your engagement!
>> >
>> > Have fun!
>> >
>>
>> Will do.
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

--
Amirouche ~ https://hyper.dev

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata