Re: [Wikidata] Wikidata query performance paper

7 Aug 2016

Hey Daniel,

On 07-08-2016 7:03, Daniel Kinzler wrote:
...
  Hi Aidan!

 Thank you for this very interesting research!

 Query performance was of course on of the key factors for selecting the
 technology to use for the query services. However, it was only one among several
 more. The Wikidata use case is different from most common scenarios in some
 ways, for instance:

 * We cannot optimize for specific queries, since users are free to submit any
 query they like.
 * The data representation needs to be intuitive enough for (thenically inclined)
 casual users to grasp and write queries.
 * The data doesn't hold still, it needs to be updated continuously, mutliple
 times per second.
 * Our data types are more complex than usual - for instance, we suppor tmultiple
 calendar models fro dates, and not only values but also different accuracies up
 to billions of years; we use "quantities" with unit and uncertainty instead of
 plain numbers, etc.

 My point is that, if we had a static data set and a handful of known queries to
 optimize for, we could have set up a relational or graph database that would be
 far more performant than what we have now. The big advantage of Blazegraph is
 its felxibility, not raw performance. 
Understood. :) Taking everything into account as mentioned above, and 
based on our own experiences with various experiments in the context of 
Wikidata and other works, I think the choice to use RDF/SPARQL was the 
right one (though I would be biased on this issue since I've worked in 
the area for a long time). I guess the more difficult question then, is, 
which RDF/SPARQL implementation to choose (since any such implementation 
should cover as least points 1, 2 and 4 in a similar way), which in turn 
reduces down to the distinguishing questions of performance, licensing, 
distribution, maturity, tech support, development community, and 
non-standard features (keyword search), etc.

Based on raw query performance, based personally on what I have seen, I 
think Virtuoso probably has the lead at the moment in that it has 
consistently outperformed other SPARQL engines, not only in our Wikidata 
experiments, but in other benchmarks by other authors. However, taking 
all the other points into account, particularly in terms of licensing, 
Blazegraph does seem to have been a sound choice. And the current query 
service does seem to be a sound base to work forward from.

...
  It might be interesting to you to know that we
initially started to implement
 the query service against a graph database, Titan - which was discontinued while
 we were still getting up to speed. Luckily this happened early on, it would have
 been quite painful to switch after we had gone live. 
This is indeed good to know! (We considered other graph database 
engines, but we did not think Gremlin was a good fit with what Wikidata 
was trying to achieve in the sense of being too "imperative": though one 
can indeed do something like bgps with the language, it's not 
particularly easy, nor intuitive.)

Cheers,
Aidan

...
  Am 06.08.2016 um 18:19 schrieb Aidan Hogan:
  Hey all,

 Recently we wrote a paper discussing the query performance for Wikidata,
 comparing different possible representations of the knowledge-base in Postgres
 (a relational database), Neo4J (a graph database), Virtuoso (a SPARQL database)
 and BlazeGraph (the SPARQL database currently in use) for a set of equivalent
 benchmark queries.

 The paper was recently accepted for presentation at the International Semantic
 Web Conference (ISWC) 2016. A pre-print is available here:

 http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

 Of course there are some caveats with these results in the sense that perhaps
 other engines would perform better on different hardware, or different styles of
 queries: for this reason we tried to use the most general types of queries
 possible and tried to test different representations in different engines (we
 did not vary the hardware). Also in the discussion of results, we tried to give
 a more general explanation of the trends, highlighting some strengths/weaknesses
 for each engine independently of the particular queries/data.

 I think it's worth a glance for anyone who is interested in the
 technology/techniques needed to query Wikidata.

 Cheers,
 Aidan

 P.S., the paper above is a follow-up to a previous work with Markus Krötzsch
 that focussed purely on RDF/SPARQL:

 http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

 (I'm not sure if it was previously mentioned on the list.)

 P.P.S., as someone who's somewhat of an outsider but who's been watching on for
 a few years now, I'd like to congratulate the community for making Wikidata what
 it is today. It's awesome work. Keep going. :)

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Wikidata query performance paper