Hello Sebastian,
First thanks a lot for the reply. I started to believe that what I was saying was complete nonsense.
Le mer. 12 juin 2019 à 16:51, Sebastian Hellmann < hellmann@informatik.uni-leipzig.de> a écrit :
Hi Amirouche, On 12.06.19 14:07, Amirouche Boubekki wrote:
So there needs to be some smarter solution, one that we'd unlike to
develop inhouse
Big cat, small fish. As wikidata continue to grow, it will have specific needs. Needs that are unlikely to be solved by off-the-shelf solutions.
Are you suggesting to develop the database in-house?
Yes! At least part of it. The domain specific part.
even MediaWiki uses MySQL
Yes, but it is not because of its technical merits. Similarly for PHP. Historically, PHP and MySQL were easy to setup and easy to use but otherwise difficult to work with. This is/was painful enough that nowadays the goto RDBMS is PostgreSQL even if MySQL is still very popular [0][1]. Those are technical reasons. Also, I agree it is not because MySQL had no ACID guarantees when it started 1995, that nowadays it is a bad choice.
[0] https://trends.google.com/trends/explore?q=MySQL,PostgreSQL [1] https://stackshare.io/stackups/mysql-vs-postgresql
but one that has already been verified by industry experience and other deployments.
FoundationDB and WiredTiger are respectively used at Apple (among other companies) and MongoDB since 3.2 all over-the-world. WiredTiger is also used at Amazon.
Let`s not talk about MongoDB, it is irrelevant and very mixed.
I am giving an example deployment of WiredTiger. WiredTiger is an ordered Key-Value Store that is the storage engine of MongoDB since 3.2. It was created by independent company and later mongodb acquired WiredTiger. It is still GPLv2 or v3. Among the founders there is one of the engineer that created bsddb that Oracle has bought. Also, I am not saying WiredTiger solve all the problems of mongodb. I am just saying that because WiredTiger is the storage backend of MongoDB since 3.2 it has seen widespread usage and testing.
Some say it is THE solution for scalability, others have said it was the
biggest disappointment.
Some people gave warnings about the technical issues of mongodb before 3.2. Also, Caveat emptor. The situation is better that a few years back. After all that was open source / free software / source available software since the beginning.
Like I said above, WiredTiger is not the solution of all problems. I cited WiredTiger as a possible tool for building a cluster similar the current one where machines have full copies of the data. The advantage of WiredTiger is that it is easier to setup (compared to a distributed database) but it still requires fine-tuning / configuration. Also, there is many other Ordered Key-Value store in the wild. I have documented those in the document:
https://github.com/scheme-requests-for-implementation/srfi-167/blob/master/l...
In particular, if WDQS doesn't want to use ACID transactions, there might be a better solution. Other popular options are LMDB (used in OpenLDAP) and RocksDB by Facebook (that is LevelDB fork). But again, that is ONE possibility, my design / database work with any of the libraries described in the above libraries.md url.
My recommendation for production cluster is to use FoundationDB. Because it can scale horizontally and provides single / double / triple replication. If a node is down, the write and reads can still continue if you have enough machine up.
WiredTiger would better suited for single machine (and my database (can) support both WiredTiger and FoundationDB with the same code base).
Do FoundationDB and WiredTiger have any track record for hosting open data
projects or being chosen by open data projects?
tl;dr: I don't know.
Like I said previously, WiredTiger is used in many contexts among others it used at Amazon Web Services (AWS).
FoundationDB is used at Apple, I don't remember which services rely on it but at least the Data Science team rely on it. The main contributor did a lightning talk about it:
Entity Store: A FoundationDB Layer for Versioned Entities with Fine Grained https://youtu.be/16uU_Aaxp9Y
That is the use-case that looks the more like data.
More on popularity contest, it is used at WaveFront (owned by VMWare) that is an analytic tool. Here is a talk:
Running FDB at scale https://youtu.be/M438R4SlTFE
JanusGraph has FDB backend, see the talk:
The JanusGraph FoundationDB Storage Adapter https://youtu.be/rQM_ZPZy8Ck
It is also used at SnowFlake https://www.snowflake.com/ that is apparently a datawhare house, here is the talk:
How FoundationDB powers SnowflakeDB's metadata https://youtu.be/KkeyjFMmIf8
It is also used at SkuVault as multi-model database, see the forum topic:
https://forums.foundationdb.org/t/success-story-foundationdb-at-skuvault/336 https://youtu.be/KkeyjFMmIf8
Again, I think the popularity of a tool is a hint. For instance, LevelDB is very popular but it far from the best in terms of speed. Similarly, I would not recommend Oracle BerkeleyDB, even if it is owned by Oracle. That said, database configuration and fine tuning is an art. So prolly I did something wrong with bsdd and leveldb. Maybe. But at least, in my proposal it is possible to benchmark several open source vendors.
PostgreSQL and MySQL are widely used, e.g. OpenStreetMaps. Virtuoso by
DBpedia, LODCloud cache and Uniprot.
Yes, I know. I am waiting for a proposal to run WDQS on top of MySQL or PostgreSQL.
I don't know FoundationDB or WiredTiger, but in the past there were often
these OS projects published by large corporations that worked in-house,
Those are details I can not have. There is a few hints in the case of WiredTiger, in the sense that they are branches named after mongodb e.g. https://github.com/wiredtiger/wiredtiger/tree/mongodb-4.0 so it seems mongodb use a specific branch that is public.
For FoundationDB, like I said previously, setting up a cluster is more demanding as it is a distributed database. But it is also more future-proof.
that worked in-house,
I would like to note, that IF WDQS is hosted at openlink, the problem is the same, if not worse.
but not the OS variant. Apache UIMA was one such example. Maybe Blazegraph works much better if you move to Neptune, that could be a sales hook.
Any open data projects that are running open databases with FoundationDB and WiredTiger? Where can I query them?
Thanks for asking. I will set up a wiredtiger instance of wikidata. I need a few days, maybe a week (or two :)).
I could setup FoundationDB on a single machine instead but it will require more time (maybe one more week).
Also, it will not support geo-queries. I will try to make labelling work but with a custom syntax (inspired form SPARQL).
"Evaluation of Metadata Representations in RDF stores"
I don't understand how this is related to the scaling issues.
Not 100% pertinent, but do you have a better paper?
I have vaguely read the paper. On the topic of provenance, I would argue to rely a n+1 tuple items. Otherwise, off-topic but I can cite Bernstein MVCC paper which says that read operations don't block write operations and write operations don't block read operations which is what WT and FDB use internally.
[About proprietary version Virtuoso], I dare say [it must have] enormous
advantage for us to consider running it in production.
That will be vendor lock-in for wikidata and wikimedia along all the poor souls that try to interop with it.
Actually Uniprot and Kingsley suggested to host the OS version. Sounded like this will hold for 5 more years, which is probably the average lifecycle. There is also SPARQL, which normally doesn`t do vendor lock-ins. Maybe you mean that nobody can rent 15 servers and install the same setup as WMF for Wikidata. That would be true. Switching always seems possible though.
Reproducibility is a key topic in engineering. That's why there is so much As-A-Service fu around.
Delegating the infrastructure to another party would be a risk and my last choice as far as my values as a human being are concerned. That is what I mean by "portable" wikidata. This is somewhat off-topic but I think that making wikimedia infrastructure reproducible _and_ portable should be some kind of priority for the organisation. That's why I value projects like kiwix.
Again, thanks for the reply.