Hi Aidan!
Thank you for this very interesting research!
Query performance was of course on of the key factors for selecting the
technology to use for the query services. However, it was only one among several
more. The Wikidata use case is different from most common scenarios in some
ways, for instance:
* We cannot optimize for specific queries, since users are free to submit any
query they like.
* The data representation needs to be intuitive enough for (thenically inclined)
casual users to grasp and write queries.
* The data doesn't hold still, it needs to be updated continuously, mutliple
times per second.
* Our data types are more complex than usual - for instance, we suppor tmultiple
calendar models fro dates, and not only values but also different accuracies up
to billions of years; we use "quantities" with unit and uncertainty instead of
plain numbers, etc.
My point is that, if we had a static data set and a handful of known queries to
optimize for, we could have set up a relational or graph database that would be
far more performant than what we have now. The big advantage of Blazegraph is
its felxibility, not raw performance.
It might be interesting to you to know that we initially started to implement
the query service against a graph database, Titan - which was discontinued while
we were still getting up to speed. Luckily this happened early on, it would have
been quite painful to switch after we had gone live.
-- daniel
Am 06.08.2016 um 18:19 schrieb Aidan Hogan:
Hey all,
Recently we wrote a paper discussing the query performance for Wikidata,
comparing different possible representations of the knowledge-base in Postgres
(a relational database), Neo4J (a graph database), Virtuoso (a SPARQL database)
and BlazeGraph (the SPARQL database currently in use) for a set of equivalent
benchmark queries.
The paper was recently accepted for presentation at the International Semantic
Web Conference (ISWC) 2016. A pre-print is available here:
http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf
Of course there are some caveats with these results in the sense that perhaps
other engines would perform better on different hardware, or different styles of
queries: for this reason we tried to use the most general types of queries
possible and tried to test different representations in different engines (we
did not vary the hardware). Also in the discussion of results, we tried to give
a more general explanation of the trends, highlighting some strengths/weaknesses
for each engine independently of the particular queries/data.
I think it's worth a glance for anyone who is interested in the
technology/techniques needed to query Wikidata.
Cheers,
Aidan
P.S., the paper above is a follow-up to a previous work with Markus Krötzsch
that focussed purely on RDF/SPARQL:
http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf
(I'm not sure if it was previously mentioned on the list.)
P.P.S., as someone who's somewhat of an outsider but who's been watching on for
a few years now, I'd like to congratulate the community for making Wikidata what
it is today. It's awesome work. Keep going. :)
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Daniel Kinzler
Senior Software Developer
Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.