Hi Joachim,
I think the problem is not to answer your query in 5min or so (Wikidata
Toolkit on my laptop takes 27min without a database, by simply parsing
the whole data file, so any database that already has the data should be
much faster). The bigger issue is that you would have to configure the
site to run for 5min before timeout. This would mean that other queries
that never terminate (because they are really hard) also can run for at
least this time. It seems that this could easily cause the service to
break down.
Maybe one could have an "unstable" service on a separate machine that
does the same as WDQS but with a much more liberal timeout and less
availability (if it's overloaded a lot, it will just be down more often,
but you would know when you use it that this is the deal).
Cheers,
Markus
On 11.02.2016 15:54, Neubert, Joachim wrote:
Hi Stas,
Thanks for your answer. You asked how long the query runs: 8.21 sec (having processed
6443 triples), in an example invocation. If roughly linear, that could mean 800-1500 sec
for the whole set. However, I would expect a clearly shorter runtime: I routinely use
queries of similar complexity and result sizes on ZBW's public endpoints. One
arbitrary selected query which extracts data from GND runs for less than two minutes to
produce 1.2m triples.
Given the size of Wikidata, I wouldn't consider such an use abusive. Of course, if
you have lots of competing queries and resources are limited, it is completely legitimate
to implement some policy which formulates limits and enforces them technically (throddle
down long-running queries, or limit the number of produced triples, or the execution time,
or whatever seems reasonable and can be implemented).
Anyway, in this case (truncation in the middle of a statement), it looks much more like
some technical bug (or an obscure timeout somewhere down the way). The execution time and
the result size varies widely:
5.44s empty result
8.60s 2090 triples
5.44s empty result
22.70s 27352 triples
Can you reproduce this kind of results with the given query, or with other supposedly
longer-running queries?
Thanks again for looking into this.
Cheers, Joachim
PS. I plan to set up an own Wikidata SPAQL endpoint to do more complex things, but that
depends on a new machine which will be available in some month. For now, I'd just like
to know which for "our" persons (economists and the like) have wikipedia pages.
PPS. From my side, I would much more have liked to build a query which asks for exactly
the GND IDs I'm interested in (about 430.000 out of millions of GNDs). This would have
led to a much smaller result - but I cannot squeeze that query into a GET request ...
-----Ursprüngliche Nachricht-----
Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Stas Malyshev
Gesendet: Donnerstag, 11. Februar 2016 01:35
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
Hi!
I try to extract all mappings from wikidata to
the GND authority file,
along with the according wikipedia pages, expecting roughly 500,000 to
1m triples as result.
As a starting note, I don't think extracting 1M triples may be the best way to use
query service. If you need to do processing that returns such big result sets - in
millions - maybe processing the dump - e.g. with wikidata toolkit at
https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?
However, with various calls, I get much less
triples (about 2,000 to
10,000). The output seems to be truncated in the middle of a statement, e.g.
It may be some kind of timeout because of the quantity of the data being sent. How long
does such request take?
--
Stas Malyshev
smalyshev(a)wikimedia.org
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata