Le mer. 12 juin 2019 à 19:11, Stas Malyshev <smalyshev@wikimedia.org> a écrit :
Hi!

>> So there needs to be some smarter solution, one that we'd unlike to
> develop inhouse
>
> Big cat, small fish. As wikidata continue to grow, it will have specific
> needs.
> Needs that are unlikely to be solved by off-the-shelf solutions.

Here I think it's good place to remind that we're not Google, and
developing a new database engine inhouse is probably a bit beyond our
resources and budgets.

Today, the problem is not the same as the one MySQL, PostgreSQL, blazegraph
and openlink had when they started working on their respective databases. See
below.
 
Fitting existing solution to our goals - sure, but developing something new of 
that scale is probably not going to happen.

It will.

> FoundationDB and WiredTiger are respectively used at Apple (among other
> companies)
> and MongoDB since 3.2 all over-the-world. WiredTiger is also used at Amazon.

I believe they are, but I think for our particular goals we have to
limit themselves for a set of solution that are a proven good match for
our case.

See the other mail I just sent. We are a turning point in database engineering
history. The very last database systems that were built are all based on
Ordered Key Value Store, see Google Spanner paper [0].

Thanks to WT/MongoDB and Apple, those are readily available, in widespread use
and fully open source. It is only missing a few pieces for making it work a fully
backward compatible way with WDQS (at scale).

[0] https://ai.google/research/pubs/pub39966
 
> That will be vendor lock-in for wikidata and wikimedia along all the
> poor souls that try to interop with it.

Since Virtuoso is using standard SPARQL, it won't be too much of a
vendor lock in, though of course the standard does not cover all, so
some corners are different in all SPARQL engines.

There is a big chance that same thing that happened with the www will
happen with RDF. That is one big player own all the implementations.
 
This is why even migration between SPARQL engines, even excluding 
operational aspects, is non-trivial.

I agree.
 
Of course, migration to any non-SPARQL engine would be order of magnitude 
more disruptive, so right now we do not seriously consider doing that.

I also agree.
 
As I already mentioned, there's a difference between "you can do it" and
"you can do it efficiently". [...] The tricky part starts when you need to run millions
of queries on 10B triples database. If your backend is not optimal for
that task, it's not going to perform.

I already did small benchmarks against blazegraph. I will do more intensive
benchmarks using wikidata (and reduce the requirements in terms of SSD).


Thanks for the reply.