Re: [Wikidata] Wikidata query performance paper

6 Aug 2016

Hi Stas,

[I'm sorry, I just realised this email was mysteriously sent before it 
was finished. I'll respond in a moment to your other mail.]

On 06-08-2016 17:38, Stas Malyshev wrote:
...
  Hi!

  The paper was recently accepted for presentation
at the International
 Semantic Web Conference (ISWC) 2016. A pre-print is available here:

 http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf 
 Thank you for the link!
 It would be interesting to see actual data representations used for RDF
 (e.g. examples of the data or more detailed description). I notice that
 they differ substantially from what we use in the Wikidata Query service
 implementation, used with Blazegraph, and also some of the performance
 features we have implemented are probably not part of your
 implementation. In any case, it would be interesting to know the details
 of which RDF representations were used. 
There's a brief summary in the paper of the models used. In terms of all 
the "gory" details of how everything was generated, (hopefully) all of 
the relevant details supporting the paper should be available here:

http://users.dcc.uchile.cl/~dhernand/wquery/

The RDF representations are summarised in Figure 2. The code we used to 
generate those representations is mentioned here:

http://users.dcc.uchile.cl/~dhernand/wquery/#download-the-code
http://users.dcc.uchile.cl/~dhernand/wquery/#translate-the-data-to-rdf

Note we did not consider any "direct triples" in the representations 
since we felt this would effectively be "covered" by the Named Graphs 
representation. Rather than mixing direct triples and reified 
representations (like in the current service), we chose to keep them 
separate.

...
  I also note that only statements and qualifiers are
mentioned in most of
 the text, but very little mention of sitelinks and references. Were they
 part of the model too? 
We just generalised sitelinks and references as a special type of 
qualifier (actually I don't think the paper mentions sitelinks but we 
mention this in the context of references).

...
  Due to the different RDF semantics, it would be also
interesting to get
 more details about how the example queries were translated to the RDF
 representation(s) used in the article. Was it an automatic process or
 they were translated manually? Is it possible to see them? 
I guess that depends on what you mean by "automatic" or "manual". :)

Automatic scripts were manually coded to convert from the JSON dump to 
each representation. The code is linked above.

We didn't put the dataset up (since the raw data and the code are 
provided and can be used to generate them and the RDF datasets are 
obviously large) but if you want a copy of the raw RDF data we 
generated, let me know.

...
  When working on Query Service implementation, we
considered a number of
 possible representations, which regard to both performance and semantic
 completeness. One of the conclusions was that achieving adequate
 semantic completeness and performance on relational database, while
 allowing people to (relatively) easy write complex queries is not
 possible, due to relational engines not being a good match for
 hierachical graph-like structures in Wikidata. 
I'm not sure I follow on this part, in particular on the part of 
"semantic completeness" and why this is hard to achieve in the context 
of relational databases. (I get the gist but don't understand enough to 
respond directly ... but perhaps below I can answer indirectly?)

...
  It would be interesting to look at the Postgres
implementation of the
 data model and queries to see whether your conclusions were different in
 this case. 
A sketch of the relational schema is given in Figure 3 of the paper 
(which is not too dissimilar to the Named Graph representation for RDF) 
and some more low level details, including code, etc., in the link 
above, including details on indexing. This was something we admittedly 
has to play around with quite a bit.

Our general experiences of using Postgres were:

* It's very good for simple queries that involve a single join through a 
primary/foreign key (a caveat here: we used the "direct client" of 
Postgres since we could not find a HTTP client like other engines).

* It's not so good when there's a lot of "self-joins" in the query 
(compared with Virtuoso), like for "bushy queries" (or what we call 
"snowflake queries"), or when multiple values for a tuple are given 
(i.e., a single pattern contains multiple constants) but neither on 
their own are particularly selective. We figure that perhaps Virtuoso 
has special optimisations for such self-joins since they would be much 
more common in an RDF/SPARQL scenario than a relational/SQL scenario.

* Encoding object values with different datatypes (booleans, dates, 
etc.) was a pain. One option was to have separate tables/columns for 
each datatype, which would complicate queries and also leave the 
question of how to add calendars, precisions, etc. Another option was to 
use JSON strings to encode the values (the version of Postgres we used 
just considered these as strings, but I think the new version has some 
JSONB(?) support that could help get around this).

Probably some of these issues could be resolved by playing around with 
the schema and/or the indexing, but perhaps relating to what you were 
saying, the result would be a pretty "exceptional" schema difficult to 
write queries for.

A more general problem we encountered:

* SQL has poor support for arbitrary-length path queries (RPQs/property 
paths). You can do something like that using the WITH RECURSIVE feature, 
but this is a much more general feature that did not work well for 
Postgres in initial experiments. We don't really report the details of 
this in the paper, but our experience is that Postgres would not support 
these well. A lot of the examples we saw on the query service use the * 
or + feature of SPARQL property paths (esp. for types). This would be an 
issue in Postgres (perhaps it could be partially solved by materialising 
some transitive closures, e.g., on types, but something as flexible as 
property paths didn't seem feasible to us).

Cheers,
Aidan

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Wikidata query performance paper