Hi Stas,
[I'm sorry, I just realised this email was mysteriously sent before it
was finished. I'll respond in a moment to your other mail.]
On 06-08-2016 17:38, Stas Malyshev wrote:
Hi!
The paper was recently accepted for presentation
at the International
Semantic Web Conference (ISWC) 2016. A pre-print is available here:
http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf
Thank you for the link!
It would be interesting to see actual data representations used for RDF
(e.g. examples of the data or more detailed description). I notice that
they differ substantially from what we use in the Wikidata Query service
implementation, used with Blazegraph, and also some of the performance
features we have implemented are probably not part of your
implementation. In any case, it would be interesting to know the details
of which RDF representations were used.
There's a brief summary in the paper of the models used. In terms of all
the "gory" details of how everything was generated, (hopefully) all of
the relevant details supporting the paper should be available here:
http://users.dcc.uchile.cl/~dhernand/wquery/
The RDF representations are summarised in Figure 2. The code we used to
generate those representations is mentioned here:
http://users.dcc.uchile.cl/~dhernand/wquery/#download-the-code
http://users.dcc.uchile.cl/~dhernand/wquery/#translate-the-data-to-rdf
Note we did not consider any "direct triples" in the representations
since we felt this would effectively be "covered" by the Named Graphs
representation. Rather than mixing direct triples and reified
representations (like in the current service), we chose to keep them
separate.
I also note that only statements and qualifiers are
mentioned in most of
the text, but very little mention of sitelinks and references. Were they
part of the model too?
We just generalised sitelinks and references as a special type of
qualifier (actually I don't think the paper mentions sitelinks but we
mention this in the context of references).
Due to the different RDF semantics, it would be also
interesting to get
more details about how the example queries were translated to the RDF
representation(s) used in the article. Was it an automatic process or
they were translated manually? Is it possible to see them?
I guess that depends on what you mean by "automatic" or "manual". :)
Automatic scripts were manually coded to convert from the JSON dump to
each representation. The code is linked above.
We didn't put the dataset up (since the raw data and the code are
provided and can be used to generate them and the RDF datasets are
obviously large) but if you want a copy of the raw RDF data we
generated, let me know.
When working on Query Service implementation, we
considered a number of
possible representations, which regard to both performance and semantic
completeness. One of the conclusions was that achieving adequate
semantic completeness and performance on relational database, while
allowing people to (relatively) easy write complex queries is not
possible, due to relational engines not being a good match for
hierachical graph-like structures in Wikidata.
I'm not sure I follow on this part, in particular on the part of
"semantic completeness" and why this is hard to achieve in the context
of relational databases. (I get the gist but don't understand enough to
respond directly ... but perhaps below I can answer indirectly?)
It would be interesting to look at the Postgres
implementation of the
data model and queries to see whether your conclusions were different in
this case.
A sketch of the relational schema is given in Figure 3 of the paper
(which is not too dissimilar to the Named Graph representation for RDF)
and some more low level details, including code, etc., in the link
above, including details on indexing. This was something we admittedly
has to play around with quite a bit.
Our general experiences of using Postgres were:
* It's very good for simple queries that involve a single join through a
primary/foreign key (a caveat here: we used the "direct client" of
Postgres since we could not find a HTTP client like other engines).
* It's not so good when there's a lot of "self-joins" in the query
(compared with Virtuoso), like for "bushy queries" (or what we call
"snowflake queries"), or when multiple values for a tuple are given
(i.e., a single pattern contains multiple constants) but neither on
their own are particularly selective. We figure that perhaps Virtuoso
has special optimisations for such self-joins since they would be much
more common in an RDF/SPARQL scenario than a relational/SQL scenario.
* Encoding object values with different datatypes (booleans, dates,
etc.) was a pain. One option was to have separate tables/columns for
each datatype, which would complicate queries and also leave the
question of how to add calendars, precisions, etc. Another option was to
use JSON strings to encode the values (the version of Postgres we used
just considered these as strings, but I think the new version has some
JSONB(?) support that could help get around this).
Probably some of these issues could be resolved by playing around with
the schema and/or the indexing, but perhaps relating to what you were
saying, the result would be a pretty "exceptional" schema difficult to
write queries for.
A more general problem we encountered:
* SQL has poor support for arbitrary-length path queries (RPQs/property
paths). You can do something like that using the WITH RECURSIVE feature,
but this is a much more general feature that did not work well for
Postgres in initial experiments. We don't really report the details of
this in the paper, but our experience is that Postgres would not support
these well. A lot of the examples we saw on the query service use the *
or + feature of SPARQL property paths (esp. for types). This would be an
issue in Postgres (perhaps it could be partially solved by materialising
some transitive closures, e.g., on types, but something as flexible as
property paths didn't seem feasible to us).
Cheers,
Aidan