Le jeu. 6 juin 2019 à 21:33, Guillaume Lederrey glederrey@wikimedia.org a écrit :
Hello all!
There has been a number of concerns raised about the performance and scaling of Wikdata Query Service. We share those concerns and we are doing our best to address them. Here is some info about what is going on:
In an ideal world, WDQS should:
- scale in terms of data size
- scale in terms of number of edits
- have low update latency
- expose a SPARQL endpoint for queries
- allow anyone to run any queries on the public WDQS endpoint
- provide great query performance
- provide a high level of availability
I will add that, in an ideal world, setting up wikidata ie. the interface that allows edits and the entity search service and WDQS.
wikidata tools should be (more) accessible.
Scaling graph databases is a "known hard problem", and we are reaching a scale where there are no obvious easy solutions to address all the above constraints. At this point, just "throwing hardware at the problem" is not an option anymore.
Reasonably, addressing all of the above constraints is unlikely to
ever happen.
never say never ;-)
For example, the update process is asynchronous. It is by nature expected to lag. In the best case, this lag is measured in minutes, but can climb to hours occasionally. This is a case of prioritizing stability and correctness (ingesting all edits) over update latency. And while we can work to reduce the maximum latency, this will still be an asynchronous process and needs to be considered as such.
We currently have one Blazegraph expert working with us to address a number of performance and stability issues. We are planning to hire an additional engineer to help us support the service in the long term. You can follow our current work in phabricator [2].
If anyone has experience with scaling large graph databases, please reach out to us, we're always happy to share ideas!
Good luck!
Thanks all for your patience!
Guillaume
[1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy
Here is my point of view regarding some discussion happening in the talk page:
Giving up on SPARQL.
There is an ongoing effort to draft a 1.2 https://github.com/w3c/sparql-12 version of the SPARQL. It is the right time to give some feedback.
Also, look at https://github.com/w3c/EasierRDF/
JanusGraph http://janusgraph.org/ (successor of Titan, now part
DataStax) - Written in java, using scalable data-storage (cassandra/hbase) and indexing engines (ElasticSearch/SolR), queryable
That would make wikidata much less accessible. Even if JanusGraph has a Oracle Berkeley backend. The full-text search and geospatial indices are in yet-another-processus.
I can't think of any other way than transforming the wikidata RDF
representation to a more suitable one for graph-properties engines
FWIW, OpenCog's AtomSpace has a neo4j backend but they do not use it.
Also, graph-properties engines makes slow to represent things like:
("wikidata", "used-by", "opencog") ("wikidata", "used-by", "google")
That is, one has to create an hyper-edge if you want to be able to query those facts.
Best regards,
Amirouche ~ amz3