Hey Markus,
On 06-08-2016 15:29, Markus Kroetzsch wrote:
Hi Aidan,
Thanks, very interesting, though I have not read the details yet.
I wonder if you have compared the actual query results you got from the
different stores. As far as I know, Neo4J actually uses a very
idiosyncratic query semantics that is neither compatible with SPARQL
(not even on the BGP level) nor with SQL (even for SELECT-PROJECT-JOIN
queries). So it is difficult to compare it to engines that use SQL or
SPARQL (or any other standard query language, for that matter). In this
sense, it may not be meaningful to benchmark it against such systems.
Yes, SPARQL has a homomorphism-based semantics (where a single result
can repeat an edge or node an arbitrary amount of times without problem)
whereas I believe that Neo4J has a sort of
pseudo-isomorphism-no-repeated-edge semantics in its evaluation (where a
result cannot reuse the same edge twice, but can match the same node to
multiple variables). Our queries were generated in such a way that no
edges would be repeated. We also applied a distinct (set) semantics in
all cases. For queries that repeat edges, indeed there would be a problem.
In terms of checking answers, we cross-referenced the number of results
returned in each case. Where there were no errors (exceptions or
timeouts), the result sizes overall were verified to be almost the same
(something like 99.99%). The small differences were caused by things
like BlazeGraph rejecting dates like February 30th that other engines
didn't. We accepted this as close enough ... as not going to affect the
performance results.
Our results and experiences were, in general, quite negative with
respect to using Neo4J at the moment. This was somewhat counter to our
initial expectations in that we thought that Wikidata would fit
naturally with the property graph model that Neo4J uses, and also more
generally in terms of the relative popularity of Neo4J [1].
We encountered a lot of issues, not only in terms of performance, but
also in terms of indexing and representation (limited support for
lookups on edge information), query language features (no RPQs: only
star on simple labels), query planning (poor selectively decisions when
processing bgps), etc. Our general impression is that Neo4J started with
a specific use-case in mind (traversing nodes following paths) for which
it is specialised, but does not currently work well for general basic
graph pattern matching, and hence does not match well with the Wikidata
use-case.
Regarding Virtuoso, the reason for not picking it for
Wikidata was the
lack of load-balancing support in the open source version, not the
performance of a single instance.
This is good to know! We were admittedly curious about this.
On a side note, the results we presented for BlazeGraph could improve
dramatically if one could isolate queries that timed out. Once one query
in a sequence timed-out (we used server-side timeouts), we observed that
a run of queries would then timeout, possibly a locking problem or
resource leak. Also Daniel mentioned that from discussion with the devs,
they claim that the current implementation works best on SSD hard
drives; our experiments were on a standard SATA.
Best,
Aidan
[1]
http://db-engines.com/en/ranking (anecdotal of course)
On 06.08.2016 18:19, Aidan Hogan wrote:
Hey all,
Recently we wrote a paper discussing the query performance for Wikidata,
comparing different possible representations of the knowledge-base in
Postgres (a relational database), Neo4J (a graph database), Virtuoso (a
SPARQL database) and BlazeGraph (the SPARQL database currently in use)
for a set of equivalent benchmark queries.
The paper was recently accepted for presentation at the International
Semantic Web Conference (ISWC) 2016. A pre-print is available here:
http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf
Of course there are some caveats with these results in the sense that
perhaps other engines would perform better on different hardware, or
different styles of queries: for this reason we tried to use the most
general types of queries possible and tried to test different
representations in different engines (we did not vary the hardware).
Also in the discussion of results, we tried to give a more general
explanation of the trends, highlighting some strengths/weaknesses for
each engine independently of the particular queries/data.
I think it's worth a glance for anyone who is interested in the
technology/techniques needed to query Wikidata.
Cheers,
Aidan
P.S., the paper above is a follow-up to a previous work with Markus
Krötzsch that focussed purely on RDF/SPARQL:
http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf
(I'm not sure if it was previously mentioned on the list.)
P.P.S., as someone who's somewhat of an outsider but who's been watching
on for a few years now, I'd like to congratulate the community for
making Wikidata what it is today. It's awesome work. Keep going. :)
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata