Hey Markus,
On 06-08-2016 15:29, Markus Kroetzsch wrote:
Hi Aidan,
Thanks, very interesting, though I have not read the details yet.
I wonder if you have compared the actual query results you got from the different stores. As far as I know, Neo4J actually uses a very idiosyncratic query semantics that is neither compatible with SPARQL (not even on the BGP level) nor with SQL (even for SELECT-PROJECT-JOIN queries). So it is difficult to compare it to engines that use SQL or SPARQL (or any other standard query language, for that matter). In this sense, it may not be meaningful to benchmark it against such systems.
Yes, SPARQL has a homomorphism-based semantics (where a single result can repeat an edge or node an arbitrary amount of times without problem) whereas I believe that Neo4J has a sort of pseudo-isomorphism-no-repeated-edge semantics in its evaluation (where a result cannot reuse the same edge twice, but can match the same node to multiple variables). Our queries were generated in such a way that no edges would be repeated. We also applied a distinct (set) semantics in all cases. For queries that repeat edges, indeed there would be a problem.
In terms of checking answers, we cross-referenced the number of results returned in each case. Where there were no errors (exceptions or timeouts), the result sizes overall were verified to be almost the same (something like 99.99%). The small differences were caused by things like BlazeGraph rejecting dates like February 30th that other engines didn't. We accepted this as close enough ... as not going to affect the performance results.
Our results and experiences were, in general, quite negative with respect to using Neo4J at the moment. This was somewhat counter to our initial expectations in that we thought that Wikidata would fit naturally with the property graph model that Neo4J uses, and also more generally in terms of the relative popularity of Neo4J [1].
We encountered a lot of issues, not only in terms of performance, but also in terms of indexing and representation (limited support for lookups on edge information), query language features (no RPQs: only star on simple labels), query planning (poor selectively decisions when processing bgps), etc. Our general impression is that Neo4J started with a specific use-case in mind (traversing nodes following paths) for which it is specialised, but does not currently work well for general basic graph pattern matching, and hence does not match well with the Wikidata use-case.
Regarding Virtuoso, the reason for not picking it for Wikidata was the lack of load-balancing support in the open source version, not the performance of a single instance.
This is good to know! We were admittedly curious about this.
On a side note, the results we presented for BlazeGraph could improve dramatically if one could isolate queries that timed out. Once one query in a sequence timed-out (we used server-side timeouts), we observed that a run of queries would then timeout, possibly a locking problem or resource leak. Also Daniel mentioned that from discussion with the devs, they claim that the current implementation works best on SSD hard drives; our experiments were on a standard SATA.
Best, Aidan
[1] http://db-engines.com/en/ranking (anecdotal of course)
On 06.08.2016 18:19, Aidan Hogan wrote:
Hey all,
Recently we wrote a paper discussing the query performance for Wikidata, comparing different possible representations of the knowledge-base in Postgres (a relational database), Neo4J (a graph database), Virtuoso (a SPARQL database) and BlazeGraph (the SPARQL database currently in use) for a set of equivalent benchmark queries.
The paper was recently accepted for presentation at the International Semantic Web Conference (ISWC) 2016. A pre-print is available here:
http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf
Of course there are some caveats with these results in the sense that perhaps other engines would perform better on different hardware, or different styles of queries: for this reason we tried to use the most general types of queries possible and tried to test different representations in different engines (we did not vary the hardware). Also in the discussion of results, we tried to give a more general explanation of the trends, highlighting some strengths/weaknesses for each engine independently of the particular queries/data.
I think it's worth a glance for anyone who is interested in the technology/techniques needed to query Wikidata.
Cheers, Aidan
P.S., the paper above is a follow-up to a previous work with Markus Krötzsch that focussed purely on RDF/SPARQL:
http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf
(I'm not sure if it was previously mentioned on the list.)
P.P.S., as someone who's somewhat of an outsider but who's been watching on for a few years now, I'd like to congratulate the community for making Wikidata what it is today. It's awesome work. Keep going. :)
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata