Hi Joachim,
I think the problem is not to answer your query in 5min or so (Wikidata Toolkit on my laptop takes 27min without a database, by simply parsing the whole data file, so any database that already has the data should be much faster). The bigger issue is that you would have to configure the site to run for 5min before timeout. This would mean that other queries that never terminate (because they are really hard) also can run for at least this time. It seems that this could easily cause the service to break down.
Maybe one could have an "unstable" service on a separate machine that does the same as WDQS but with a much more liberal timeout and less availability (if it's overloaded a lot, it will just be down more often, but you would know when you use it that this is the deal).
Cheers,
Markus
On 11.02.2016 15:54, Neubert, Joachim wrote:
Hi Stas,
Thanks for your answer. You asked how long the query runs: 8.21 sec (having processed 6443 triples), in an example invocation. If roughly linear, that could mean 800-1500 sec for the whole set. However, I would expect a clearly shorter runtime: I routinely use queries of similar complexity and result sizes on ZBW's public endpoints. One arbitrary selected query which extracts data from GND runs for less than two minutes to produce 1.2m triples.
Given the size of Wikidata, I wouldn't consider such an use abusive. Of course, if you have lots of competing queries and resources are limited, it is completely legitimate to implement some policy which formulates limits and enforces them technically (throddle down long-running queries, or limit the number of produced triples, or the execution time, or whatever seems reasonable and can be implemented).
Anyway, in this case (truncation in the middle of a statement), it looks much more like some technical bug (or an obscure timeout somewhere down the way). The execution time and the result size varies widely:
5.44s empty result 8.60s 2090 triples 5.44s empty result 22.70s 27352 triples
Can you reproduce this kind of results with the given query, or with other supposedly longer-running queries?
Thanks again for looking into this.
Cheers, Joachim
PS. I plan to set up an own Wikidata SPAQL endpoint to do more complex things, but that depends on a new machine which will be available in some month. For now, I'd just like to know which for "our" persons (economists and the like) have wikipedia pages.
PPS. From my side, I would much more have liked to build a query which asks for exactly the GND IDs I'm interested in (about 430.000 out of millions of GNDs). This would have led to a much smaller result - but I cannot squeeze that query into a GET request ...
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Stas Malyshev Gesendet: Donnerstag, 11. Februar 2016 01:35 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
Hi!
I try to extract all mappings from wikidata to the GND authority file, along with the according wikipedia pages, expecting roughly 500,000 to 1m triples as result.
As a starting note, I don't think extracting 1M triples may be the best way to use query service. If you need to do processing that returns such big result sets - in millions - maybe processing the dump - e.g. with wikidata toolkit at https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?
However, with various calls, I get much less triples (about 2,000 to 10,000). The output seems to be truncated in the middle of a statement, e.g.
It may be some kind of timeout because of the quantity of the data being sent. How long does such request take?
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata