Re: [Wikidata] SPARQL CONSTRUCT results truncated

12 Feb 2016

It's great how this discussion evolves - thanks to everybody!

Technically, I completely agree that in practice it may prove impossible to predict the
load a query will produce. Relational databases have invested years and years in query
optimization (e.g., Oracles cost based optimizer, which relies on extended statistics
gathered during runtime), and I can't see that similar investments are possible for
triple stores.

What I could imagine for public endpoints is the SPARQL engine monitoring and prioritizing
queries: the longer a query already runs, or the more resources it has already used, the
lower its priority is re-scheduled (up to some final limit). But this is just a
theoretical consideration, I'm not aware of any system that implements anything like
this - and it could be implemented only in the engine itself.

For ZBWs SPARQL endpoints, I've implemented a much simpler three-level strategy, which
does not involve the engine at all:

1. Endpoints which drive production-level services (e.g. autosuggest or retrieval
enhancement functions). These endpoints run on separate machines and offer completely
encapsulated services via a public API (http://zbw.eu/beta/econ-ws), without any direct
SPARQL access.

2. Public "beta" endpoints (http://zbw.eu/beta/sparql). These offer unrestricted
SPARQL access, but without any garanties about performance or availability - though of
course I do my best to keep these up and running. They run on an own virtual machine, and
should not hurt any other parts of the infrastructure when getting overloaded or out of
control.

3. Public "experimental" endpoints. These include in particular an endpoint for
the GND dataset with 130m triples. It was mainly created for internal use because (to my
best knowledge) no other public GND endpoint exists. The endpoint is not linked from the
GND pages of DNB, and I've advertised it very low-key on a few mailing lists. For
these experimental endpoints, we reserve the right to shut them down for the public if
they get flooded with more requests than they can handle.

It may be of interest, that up to now, on none of these public endpoints we came across
issues with attacks or evil-minded queries (which were a matter of concern when I started
this in 2009), nor with longer-lasting massive access. Of course, that is different for
Wikidata, where the data is of interest for _much_ more people. But if anyhow affordable,
I'd like to encourage offering some kind of experimental access with really wide
limits in an "unstable" setting, in addition to the reliable services. For most
people who just want to check out something, it's not an option to download the whole
dataset and set up an infrastructure for it. For us, this was an issue with even the much
smaller GND set.

The Linked data fragments approach Osma mentioned is very interesting (particularly the
bit about setting it up on top of an regularily updated existing endpoint), and could
provide another alternative, but I have not yet experimented with it.

Have a fine weekend - Joachim

-----Ursprüngliche Nachricht-----
Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Markus
Krötzsch
Gesendet: Freitag, 12. Februar 2016 09:44
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

On 12.02.2016 00:04, Stas Malyshev wrote:
...
  Hi!

  We basically have two choices: either we offer a
limited interface 
 that only allows for a narrow range of queries to be run at all. Or 
 we offer a very general interface that can run arbitrary queries, but 
 we impose limits on time and memory consumption. I would actually 
 prefer the first option, because it's more predictable, and doesn't get
people's hopes up too far. What do you think?

 That would require implementing pretty smart SPARQL parser... I don't 
 think it worth the investment of time. I'd rather put caps on runtime 
 and maybe also on parallel queries per IP, to ensure fair access. We 
 may also have a way to run longer queries - in fact, we'll need it 
 anyway if we want to automate lists - but that is longer term, we'll 
 need to figure out infrastructure for that and how we allocate access.

+1

Restricting queries syntactically to be "simpler" is what we did in Semantic
MediaWiki (because MySQL did not support time/memory limits per query). It is a
workaround, but it will not prevent long-running queries unless you make the syntactic
restrictions really severe (and thereby forbid many simple queries, too). I would not do
it if there is support for time/memory limits instead.

In the end, even the SPARQL engines are not able to predict reliably how complicated a
query is going to be -- it's an important part of their work (for optimising query
execution), but it is also very difficult.

Markus

...

_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] SPARQL CONSTRUCT results truncated