Hi Amirouche,
On 12.06.19 14:07, Amirouche Boubekki
wrote:
> So there needs to be some smarter solution, one that
we'd unlike to develop inhouse
Big cat, small fish. As wikidata continue to grow, it will
have specific needs.
Needs that are unlikely to be solved by off-the-shelf
solutions.
Are you suggesting to develop the database in-house?
Yes! At least part of it. The domain specific part.
even
MediaWiki uses MySQL
Yes, but it is not because of its technical merits. Similarly for PHP.
Historically, PHP and MySQL were easy to setup and easy to use
but otherwise difficult to work with. This is/was painful enough that
nowadays the goto RDBMS is PostgreSQL even if MySQL is still
very popular [0][1]. Those are technical reasons. Also, I agree it is
not because MySQL had no ACID guarantees when it started
1995, that nowadays it is a bad choice.
> but one that has already been verified by industry
experience and other deployments.
FoundationDB and WiredTiger are respectively used at Apple
(among other companies)
and MongoDB since 3.2 all over-the-world. WiredTiger is also
used at Amazon.
Let`s not talk about MongoDB, it is irrelevant and very mixed.
I am giving an example deployment of WiredTiger. WiredTiger is an ordered Key-Value
Store that is the storage engine of MongoDB since 3.2. It was created by independent
company and later mongodb acquired WiredTiger. It is still GPLv2 or v3. Among the founders
there is one of the engineer that created bsddb that Oracle has bought. Also, I am not saying
WiredTiger solve all the problems of mongodb. I am just saying that because WiredTiger is
the storage backend of MongoDB since 3.2 it has seen widespread usage and testing.
Some say it is THE solution for scalability, others have said it
was the biggest disappointment.
Some people gave warnings about the technical issues of mongodb before 3.2.
Also, Caveat emptor. The situation is better that a few years back. After all that
was open source / free software / source available software since the beginning.
Like I said above, WiredTiger is not the solution of all problems. I cited WiredTiger
as a possible tool for building a cluster similar the current one where machines have full
copies of the data. The advantage of WiredTiger is that it is easier to setup (compared
to a distributed database) but it still requires fine-tuning / configuration. Also, there is
many other Ordered Key-Value store in the wild. I have documented those in the document:
In particular, if WDQS doesn't want to use ACID transactions, there might be a better
solution. Other popular options are LMDB (used in OpenLDAP) and RocksDB by
Facebook (that is LevelDB fork). But again, that is ONE possibility, my design / database
work with any of the libraries described in the above libraries.md url.
My recommendation for production cluster is to use FoundationDB.
Because it can scale horizontally and provides single / double / triple replication. If a
node is down, the write and reads can still continue if you have enough machine up.
WiredTiger would better suited for single machine (and my database (can) support
both WiredTiger and FoundationDB with the same code base).
Do FoundationDB and WiredTiger have any track record for hosting
open data projects or being chosen by open data projects?
tl;dr: I don't know.
Like I said previously, WiredTiger is used in many contexts among others it used at Amazon Web Services (AWS).
FoundationDB is used at Apple, I don't remember which services rely on it but at least the Data Science team
rely on it. The main contributor did a lightning talk about it:
That is the use-case that looks the more like data.
More on popularity contest, it is used at WaveFront (owned by VMWare) that is an analytic tool.
Here is a talk:
JanusGraph has FDB backend, see the talk:
It is also used at
SnowFlake that is apparently a datawhare house, here is the talk:
It is also used at SkuVault as multi-model database, see the forum topic:
Again, I think the popularity of a tool is a hint. For instance, LevelDB is very popular but it far
from the best in terms of speed. Similarly, I would not recommend Oracle BerkeleyDB, even
if it is owned by Oracle. That said, database configuration and fine tuning is an art. So prolly
I did something wrong with bsdd and leveldb. Maybe. But at least, in my proposal it is possible
to benchmark several open source vendors.
PostgreSQL and MySQL are widely used, e.g. OpenStreetMaps.
Virtuoso by DBpedia, LODCloud cache and Uniprot.
Yes, I know. I am waiting for a proposal to run WDQS on top of MySQL or PostgreSQL.
I don't know FoundationDB or WiredTiger, but in the past there were
often these OS projects published by large corporations that
worked in-house,
Those are details I can not have. There is a few hints in the case of WiredTiger, in the sense that they are branches
specific branch that is public.
For FoundationDB, like I said previously, setting up a cluster is more demanding as it is a distributed database.
But it is also more future-proof.
> that
worked in-house,
I would like to note, that IF WDQS is hosted at openlink, the problem is the same, if not worse.
but not the OS variant. Apache UIMA was one such
example. Maybe Blazegraph works much better if you move to
Neptune, that could be a sales hook.
Any open data projects that are running open databases with
FoundationDB and WiredTiger? Where can I query them?
Thanks for asking. I will set up a wiredtiger instance of wikidata. I need a few days, maybe a week (or two :)).
I could setup FoundationDB on a single machine instead but it will require more time (maybe one more week).
Also, it will not support geo-queries. I will try to make labelling work but with a custom syntax (inspired form SPARQL).
> "Evaluation of Metadata
Representations in RDF stores"
I don't understand how this is related
to the scaling issues.
Not 100% pertinent, but do you have a better paper?
I have vaguely read the paper. On the topic of provenance, I would argue to rely a n+1 tuple items.
Otherwise, off-topic but I can cite Bernstein MVCC paper which says that read operations don't
block write operations and write operations don't block read operations which is what WT and FDB
use internally.
> [About proprietary version
Virtuoso], I dare say [it must have] enormous advantage for us
to consider running it in production.
That will be vendor lock-in for
wikidata and wikimedia along all the poor souls that try to
interop with it.
Actually Uniprot and Kingsley suggested to host the OS version.
Sounded like this will hold for 5 more years, which is probably
the average lifecycle. There is also SPARQL, which normally
doesn`t do vendor lock-ins. Maybe you mean that nobody can rent 15
servers and install the same setup as WMF for Wikidata. That would
be true. Switching always seems possible though.
Reproducibility is a key topic in engineering. That's why there is so much As-A-Service fu around.
Delegating the infrastructure to another party would be a risk and my last choice as far as my values
as a human being are concerned. That is what I mean by "portable" wikidata. This is somewhat
off-topic but I think that making wikimedia infrastructure reproducible _and_ portable should be
some kind of priority for the organisation. That's why I value projects like kiwix.
Again, thanks for the reply.