Re: [Wikidata] Wikidata query performance paper

6 Aug 2016

Hey Markus,

On 06-08-2016 15:29, Markus Kroetzsch wrote:
...
  Hi Aidan,

 Thanks, very interesting, though I have not read the details yet.

 I wonder if you have compared the actual query results you got from the
 different stores. As far as I know, Neo4J actually uses a very
 idiosyncratic query semantics that is neither compatible with SPARQL
 (not even on the BGP level) nor with SQL (even for SELECT-PROJECT-JOIN
 queries). So it is difficult to compare it to engines that use SQL or
 SPARQL (or any other standard query language, for that matter). In this
 sense, it may not be meaningful to benchmark it against such systems. 
Yes, SPARQL has a homomorphism-based semantics (where a single result 
can repeat an edge or node an arbitrary amount of times without problem) 
whereas I believe that Neo4J has a sort of 
pseudo-isomorphism-no-repeated-edge semantics in its evaluation (where a 
result cannot reuse the same edge twice, but can match the same node to 
multiple variables). Our queries were generated in such a way that no 
edges would be repeated. We also applied a distinct (set) semantics in 
all cases. For queries that repeat edges, indeed there would be a problem.

In terms of checking answers, we cross-referenced the number of results 
returned in each case. Where there were no errors (exceptions or 
timeouts), the result sizes overall were verified to be almost the same 
(something like 99.99%). The small differences were caused by things 
like BlazeGraph rejecting dates like February 30th that other engines 
didn't. We accepted this as close enough ... as not going to affect the 
performance results.

Our results and experiences were, in general, quite negative with 
respect to using Neo4J at the moment. This was somewhat counter to our 
initial expectations in that we thought that Wikidata would fit 
naturally with the property graph model that Neo4J uses, and also more 
generally in terms of the relative popularity of Neo4J [1].

We encountered a lot of issues, not only in terms of performance, but 
also in terms of indexing and representation (limited support for 
lookups on edge information), query language features (no RPQs: only 
star on simple labels), query planning (poor selectively decisions when 
processing bgps), etc. Our general impression is that Neo4J started with 
a specific use-case in mind (traversing nodes following paths) for which 
it is specialised, but does not currently work well for general basic 
graph pattern matching, and hence does not match well with the Wikidata 
use-case.

...
  Regarding Virtuoso, the reason for not picking it for
Wikidata was the
 lack of load-balancing support in the open source version, not the
 performance of a single instance. 
This is good to know! We were admittedly curious about this.

On a side note, the results we presented for BlazeGraph could improve 
dramatically if one could isolate queries that timed out. Once one query 
in a sequence timed-out (we used server-side timeouts), we observed that 
a run of queries would then timeout, possibly a locking problem or 
resource leak. Also Daniel mentioned that from discussion with the devs, 
they claim that the current implementation works best on SSD hard 
drives; our experiments were on a standard SATA.

Best,
Aidan

[1] http://db-engines.com/en/ranking (anecdotal of course)

...

 On 06.08.2016 18:19, Aidan Hogan wrote:
  Hey all,

 Recently we wrote a paper discussing the query performance for Wikidata,
 comparing different possible representations of the knowledge-base in
 Postgres (a relational database), Neo4J (a graph database), Virtuoso (a
 SPARQL database) and BlazeGraph (the SPARQL database currently in use)
 for a set of equivalent benchmark queries.

 The paper was recently accepted for presentation at the International
 Semantic Web Conference (ISWC) 2016. A pre-print is available here:

 http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

 Of course there are some caveats with these results in the sense that
 perhaps other engines would perform better on different hardware, or
 different styles of queries: for this reason we tried to use the most
 general types of queries possible and tried to test different
 representations in different engines (we did not vary the hardware).
 Also in the discussion of results, we tried to give a more general
 explanation of the trends, highlighting some strengths/weaknesses for
 each engine independently of the particular queries/data.

 I think it's worth a glance for anyone who is interested in the
 technology/techniques needed to query Wikidata.

 Cheers,
 Aidan

 P.S., the paper above is a follow-up to a previous work with Markus
 Krötzsch that focussed purely on RDF/SPARQL:

 http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

 (I'm not sure if it was previously mentioned on the list.)

 P.P.S., as someone who's somewhat of an outsider but who's been watching
 on for a few years now, I'd like to congratulate the community for
 making Wikidata what it is today. It's awesome work. Keep going. :)

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata 

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Wikidata query performance paper