Re: [Wikidata] Scaling Wikidata Query Service

12 Jun 2019

Hello Sebastian,

First thanks a lot for the reply. I started to believe that what I was
saying was complete nonsense.

Le mer. 12 juin 2019 à 16:51, Sebastian Hellmann <
hellmann(a)informatik.uni-leipzig.de&gt; a écrit :

...
  Hi Amirouche,
 On 12.06.19 14:07, Amirouche Boubekki wrote:

  So there needs to be some smarter solution, one
that we'd unlike to  develop inhouse

 Big cat, small fish. As wikidata continue to grow, it will have specific
 needs.
 Needs that are unlikely to be solved by off-the-shelf solutions.

 Are you suggesting to develop the database in-house?
 Yes! At least part of it. The domain specific part.

...
  even MediaWiki uses MySQL
 Yes, but it is not because of its technical merits. Similarly for PHP.
Historically, PHP and MySQL were easy to setup and easy to use
but otherwise difficult to work with. This is/was painful enough that
nowadays the goto RDBMS is PostgreSQL even if MySQL is still
very popular [0][1]. Those are technical reasons. Also, I agree it is
not because MySQL had no ACID guarantees when it started
1995, that nowadays it is a bad choice.

[0] https://trends.google.com/trends/explore?q=MySQL,PostgreSQL
[1] https://stackshare.io/stackups/mysql-vs-postgresql

...
  but one that has already been verified by industry
experience and other
 deployments.

 FoundationDB and WiredTiger are respectively used at Apple (among other
 companies)
 and MongoDB since 3.2 all over-the-world. WiredTiger is also used at
 Amazon.

 Let`s not talk about MongoDB, it is irrelevant and very mixed.

I am giving an example deployment of WiredTiger. WiredTiger is an ordered
Key-Value
Store that is the storage engine of MongoDB since 3.2. It was created by
independent
company and later mongodb acquired WiredTiger. It is still GPLv2 or v3.
Among the founders
there is one of the engineer that created bsddb that Oracle has bought.
Also, I am not saying
WiredTiger solve all the problems of mongodb. I am just saying that because
WiredTiger is
the storage backend of MongoDB since 3.2 it has seen widespread usage and
testing.

Some say it is THE solution for scalability, others have said it was the
...
  biggest disappointment.

Some people gave warnings about the technical issues of mongodb before 3.2.
Also, Caveat emptor. The situation is better that a few years back. After
all that
was open source / free software / source available software since the
beginning.

Like I said above, WiredTiger is not the solution of all problems. I cited
WiredTiger
as a possible tool for building a cluster similar the current one where
machines have full
copies of the data. The advantage of WiredTiger is that it is easier to
setup (compared
to a distributed database) but it still requires fine-tuning /
configuration. Also, there is
many other Ordered Key-Value store in the wild. I have documented those in
the document:

https://github.com/scheme-requests-for-implementation/srfi-167/blob/master/…

In particular, if WDQS doesn't want to use ACID transactions, there might
be a better
solution. Other popular options are LMDB (used in OpenLDAP) and RocksDB by
Facebook (that is LevelDB fork). But again, that is ONE possibility, my
design / database
work with any of the libraries described in the above libraries.md url.

My recommendation for production cluster is to use FoundationDB.
Because it can scale horizontally and provides single / double / triple
replication. If a
node is down, the write and reads can still continue if you have enough
machine up.

WiredTiger would better suited for single machine (and my database (can)
support
both WiredTiger and FoundationDB with the same code base).

Do FoundationDB and WiredTiger have any track record for hosting open data
...
  projects or being chosen by open data projects?
 tl;dr: I don't know.

Like I said previously, WiredTiger is used in many contexts among others it
used at Amazon Web Services (AWS).

FoundationDB is used at Apple, I don't remember which services rely on it
but at least the Data Science team
rely on it. The main contributor did a lightning talk about it:

   Entity Store: A FoundationDB Layer for Versioned Entities with Fine
Grained <https://youtu.be/16uU_Aaxp9Y>

That is the use-case that looks the more like data.

More on popularity contest, it is used at WaveFront (owned by VMWare) that
is an analytic tool.
Here is a talk:

  Running FDB at scale <https://youtu.be/M438R4SlTFE>

JanusGraph has FDB backend, see the talk:

  The JanusGraph FoundationDB Storage Adapter <https://youtu.be/rQM_ZPZy8Ck>

It is also used at SnowFlake <https://www.snowflake.com/> that is
apparently a datawhare house, here is the talk:

   How FoundationDB powers SnowflakeDB's metadata
<https://youtu.be/KkeyjFMmIf8>

It is also used at SkuVault as multi-model database, see the forum topic:

https://forums.foundationdb.org/t/success-story-foundationdb-at-skuvault/336
<https://youtu.be/KkeyjFMmIf8>

Again, I think the popularity of a tool is a hint. For instance, LevelDB is
very popular but it far
from the best in terms of speed. Similarly, I would not recommend Oracle
BerkeleyDB, even
if it is owned by Oracle. That said, database configuration and fine tuning
is an art. So prolly
I did something wrong with bsdd and leveldb. Maybe. But at least, in my
proposal it is possible
to benchmark several open source vendors.

PostgreSQL and MySQL are widely used, e.g. OpenStreetMaps. Virtuoso by
...
  DBpedia, LODCloud cache and Uniprot.

Yes, I know. I am waiting for a proposal to run WDQS on top of MySQL or
PostgreSQL.

I don't know FoundationDB or WiredTiger, but in the past there were often
...
  these OS projects published by large corporations that
worked in-house,

Those are details I can not have. There is a few hints in the case of
WiredTiger, in the sense that they are branches
named after mongodb e.g.
https://github.com/wiredtiger/wiredtiger/tree/mongodb-4.0 so it seems
mongodb use a
specific branch that is public.

For FoundationDB, like I said previously, setting up a cluster is more
demanding as it is a distributed database.
But it is also more future-proof.

...
  that worked in-house, 
I would like to note, that IF WDQS is hosted at openlink, the problem is
the same, if not worse.

...
  but not the OS variant. Apache UIMA was one such
example. Maybe Blazegraph
 works much better if you move to Neptune, that could be a sales hook.

 Any open data projects that are running open databases with FoundationDB
 and WiredTiger? Where can I query them?

Thanks for asking. I will set up a wiredtiger instance of wikidata. I need
a few days, maybe a week (or two :)).

I could setup FoundationDB on a single machine instead but it will require
more time (maybe one more week).

Also, it will not support geo-queries. I will try to make labelling work
but with a custom syntax (inspired form SPARQL).

...

"Evaluation of Metadata Representations in RDF stores" 
 I don't understand how this is related to the scaling issues.

 Not 100% pertinent, but do you have a better paper?

I have vaguely read the paper. On the topic of provenance, I would argue to
rely a n+1 tuple items.
Otherwise, off-topic but I can cite Bernstein MVCC paper which says that
read operations don't
block write operations and write operations don't block read operations
which is what WT and FDB
use internally.

...

  [About proprietary version Virtuoso], I dare say
[it must have] enormous  advantage for us to consider running it in production.

 That will be vendor lock-in for wikidata and wikimedia along all the poor
 souls that try to interop with it.

 Actually Uniprot and Kingsley suggested to host the OS version. Sounded
 like this will hold for 5 more years, which is probably the average
 lifecycle. There is also SPARQL, which normally doesn`t do vendor lock-ins.
 Maybe you mean that nobody can rent 15 servers and install the same setup
 as WMF for Wikidata. That would be true. Switching always seems possible
 though.
 Reproducibility is a key topic in engineering. That's why there is so much
As-A-Service fu around.

Delegating the infrastructure to another party would be a risk and my last
choice as far as my values
as a human being are concerned. That is what I mean by "portable" wikidata.
This is somewhat
off-topic but I think that making wikimedia infrastructure reproducible
_and_ portable should be
some kind of priority for the organisation. That's why I value projects
like kiwix.

Again, thanks for the reply.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Scaling Wikidata Query Service