I try to extract all mappings from wikidata to the GND authority file, along with the according wikipedia pages, expecting roughly 500,000 to 1m triples as result.
However, with various calls, I get much less triples (about 2,000 to 10,000). The output seems to be truncated in the middle of a statement, e.g.
... http://d-nb.info/gnd/121043053 http://www.w3.org/2004/02/skos/core#exactMatch http://www.wikidata.org/entity/Q39963 . http://d-nb.info/gnd/121043053 http://schema.org/about https://de.wikipedia.org/wiki/Park%20Kyung-ni . http://d-nb.info/gnd/121043053 http://schema.org/about <https://en.wikipedia.org/wiki/Pa
The query (below) is called like this:
curl -X GET -H "Accept: text/plain" --silent https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=$ENCODED_QUERY -o /tmp/mappings.nt
Using turtle or rdf/xml as a format does also result in syntactically incorrect truncation in the middle of a statement. Adding "--no-buffer" to the curl command does not change anything.
Am I doing something wrong? Are there built-in limitations for the endpoint, which could result in arbitrary truncation?
Cheers, Joachim
# Get all GND mappings to persons PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX wikibase: http://wikiba.se/ontology# PREFIX p: http://www.wikidata.org/prop/ PREFIX v: http://www.wikidata.org/prop/statement/ PREFIX q: http://www.wikidata.org/prop/qualifier/ PREFIX schema: http://schema.org/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX skos: http://www.w3.org/2004/02/skos/core# # construct { ?gnd skos:exactMatch ?wd ; schema:about ?sitelink . } #select ?gndId ?wd ?wdLabel ?sitelink ?gnd where { # get all wikidata items and labels linked to GND ?wd wdt:P227 ?gndId ; rdfs:label ?wdLabel ; # restrict to wdt:P31 wd:Q5 . # instance of human # get site links (only from de/en wikipedia sites) ?sitelink schema:about ?wd ; schema:inLanguage ?language . filter (contains(str(?sitelink), 'wikipedia')) filter (lang(?wdLabel) = ?language && ?language in ('en', 'de')) bind(uri(concat('http://d-nb.info/gnd/', ?gndId)) as ?gnd) }
Hi!
I try to extract all mappings from wikidata to the GND authority file, along with the according wikipedia pages, expecting roughly 500,000 to 1m triples as result.
As a starting note, I don't think extracting 1M triples may be the best way to use query service. If you need to do processing that returns such big result sets - in millions - maybe processing the dump - e.g. with wikidata toolkit at https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?
However, with various calls, I get much less triples (about 2,000 to 10,000). The output seems to be truncated in the middle of a statement, e.g.
It may be some kind of timeout because of the quantity of the data being sent. How long does such request take?
Hoi, Two things, not everybody has the capacity to run an instance of the toolkit. When there are other reasons as well for needing a toolkit than query does not cope, it makes sense to have instances of toolkit on labs where queries like this can be run.
Your response is technical and seriously, query is a tool and it should function for people. When the tool is not good enough fix it. You cannot expect people to engage in the toolkit because most queries are community incidentals and not part of a scientific endeavour. Thanks, GerardM
On 11 February 2016 at 01:34, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I try to extract all mappings from wikidata to the GND authority file, along with the according wikipedia pages, expecting roughly 500,000 to 1m triples as result.
As a starting note, I don't think extracting 1M triples may be the best way to use query service. If you need to do processing that returns such big result sets - in millions - maybe processing the dump - e.g. with wikidata toolkit at https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?
However, with various calls, I get much less triples (about 2,000 to 10,000). The output seems to be truncated in the middle of a statement,
e.g.
It may be some kind of timeout because of the quantity of the data being sent. How long does such request take?
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Am 11.02.2016 um 10:17 schrieb Gerard Meijssen:
Your response is technical and seriously, query is a tool and it should function for people. When the tool is not good enough fix it.
What I hear: "A hammer is a tool, it should work for people. Tearing down a building with it takes forever, so fix the hammer!"
The query service was never intended to run arbitrarily large or complex queries. Sure, would be nice, but that also means committing an arbitrary amount of resources to a single request. We don't have arbitrary amounts of resources.
We basically have two choices: either we offer a limited interface that only allows for a narrow range of queries to be run at all. Or we offer a very general interface that can run arbitrary queries, but we impose limits on time and memory consumption. I would actually prefer the first option, because it's more predictable, and doesn't get people's hopes up too far. What do you think?
Oh, and +1 for making it easy to use WDT on labs.
Hoi, What I hear is that the intentions were wrong in that you did not anticipate people to get actual meaningful requests out of it.
When you state "we have two choices", you imply that it is my choice as well. It is not. The answer that I am looking for is yes, it does not function as we would like, we are working on it and in the mean time we will ensure that toolkit is available on Labs for the more complex queries.
Wikidata is a service and the service is in need of being better. Thanks, GerardM
On 11 February 2016 at 12:32, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Am 11.02.2016 um 10:17 schrieb Gerard Meijssen:
Your response is technical and seriously, query is a tool and it should
function
for people. When the tool is not good enough fix it.
What I hear: "A hammer is a tool, it should work for people. Tearing down a building with it takes forever, so fix the hammer!"
The query service was never intended to run arbitrarily large or complex queries. Sure, would be nice, but that also means committing an arbitrary amount of resources to a single request. We don't have arbitrary amounts of resources.
We basically have two choices: either we offer a limited interface that only allows for a narrow range of queries to be run at all. Or we offer a very general interface that can run arbitrary queries, but we impose limits on time and memory consumption. I would actually prefer the first option, because it's more predictable, and doesn't get people's hopes up too far. What do you think?
Oh, and +1 for making it easy to use WDT on labs.
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 11.02.2016 15:01, Gerard Meijssen wrote:
Hoi, What I hear is that the intentions were wrong in that you did not anticipate people to get actual meaningful requests out of it.
When you state "we have two choices", you imply that it is my choice as well. It is not. The answer that I am looking for is yes, it does not function as we would like, we are working on it and in the mean time we will ensure that toolkit is available on Labs for the more complex queries.
Wikidata is a service and the service is in need of being better.
Gerard, do you realise how far away from technical reality your wishes are? We are far ahead of the state of the art in what we already have for Wikidata: two powerful live query services + a free toolkit for batch analyses + several Web APIs for live lookups. I know of no site of this scale that is anywhere near this in terms of functionality. You can always ask for more, but you should be a bit reasonable too, or people will just ignore you.
Markus
On 11 February 2016 at 12:32, Daniel Kinzler <daniel.kinzler@wikimedia.de mailto:daniel.kinzler@wikimedia.de> wrote:
Am 11.02.2016 um 10:17 schrieb Gerard Meijssen: > Your response is technical and seriously, query is a tool and it should function > for people. When the tool is not good enough fix it. What I hear: "A hammer is a tool, it should work for people. Tearing down a building with it takes forever, so fix the hammer!" The query service was never intended to run arbitrarily large or complex queries. Sure, would be nice, but that also means committing an arbitrary amount of resources to a single request. We don't have arbitrary amounts of resources. We basically have two choices: either we offer a limited interface that only allows for a narrow range of queries to be run at all. Or we offer a very general interface that can run arbitrary queries, but we impose limits on time and memory consumption. I would actually prefer the first option, because it's more predictable, and doesn't get people's hopes up too far. What do you think? Oh, and +1 for making it easy to use WDT on labs. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hoi,' Markus when you read my reply on the original question you will see that my approach is different. The first thing that I pointed out was that a technical assumption has little to do with what people need. I indicated that when this is the approach, the answer is fix it. The notion that a large number of returns is outrageous is not of this time.
My approach was one where I even offered a possible solution, a crutch.
The approach Daniel took was to make me look ridiculous. His choice, not mine. I stayed polite and told him that his answers are not my answers and why. The point that I make is that Wikidata is a service. It will increasingly be used for the most outrageous queries and people will expect it to work because why else do we put all this data in there. Why else is this the data hub for Wikipedia. Why else
Do appreciate that the aim of the WMF is to share in the sum of all available knowledge. When the current technology is what we have to make do with, fine for now. Say so, but do not ridicule me for saying that it is not good enough, it is not now and it will certainly not be in the future... Thanks, GerardM
On 11 February 2016 at 15:25, Markus Krötzsch <markus@semantic-mediawiki.org
wrote:
On 11.02.2016 15:01, Gerard Meijssen wrote:
Hoi, What I hear is that the intentions were wrong in that you did not anticipate people to get actual meaningful requests out of it.
When you state "we have two choices", you imply that it is my choice as well. It is not. The answer that I am looking for is yes, it does not function as we would like, we are working on it and in the mean time we will ensure that toolkit is available on Labs for the more complex queries.
Wikidata is a service and the service is in need of being better.
Gerard, do you realise how far away from technical reality your wishes are? We are far ahead of the state of the art in what we already have for Wikidata: two powerful live query services + a free toolkit for batch analyses + several Web APIs for live lookups. I know of no site of this scale that is anywhere near this in terms of functionality. You can always ask for more, but you should be a bit reasonable too, or people will just ignore you.
Markus
On 11 February 2016 at 12:32, Daniel Kinzler
<daniel.kinzler@wikimedia.de mailto:daniel.kinzler@wikimedia.de> wrote:
Am 11.02.2016 um 10:17 schrieb Gerard Meijssen: > Your response is technical and seriously, query is a tool and it
should function > for people. When the tool is not good enough fix it.
What I hear: "A hammer is a tool, it should work for people. Tearing down a building with it takes forever, so fix the hammer!" The query service was never intended to run arbitrarily large or
complex queries. Sure, would be nice, but that also means committing an arbitrary amount of resources to a single request. We don't have arbitrary amounts of resources.
We basically have two choices: either we offer a limited interface that only allows for a narrow range of queries to be run at all. Or we offer a very general interface that can run arbitrary queries, but we impose limits on time and memory consumption. I would actually prefer the first option, because it's more predictable, and doesn't get people's hopes up too far. What do you think? Oh, and +1 for making it easy to use WDT on labs. -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Thu, Feb 11, 2016 at 5:53 PM Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi,' Markus when you read my reply on the original question you will see that my approach is different. The first thing that I pointed out was that a technical assumption has little to do with what people need. I indicated that when this is the approach, the answer is fix it. The notion that a large number of returns is outrageous is not of this time.
My approach was one where I even offered a possible solution, a crutch.
The approach Daniel took was to make me look ridiculous. His choice, not mine. I stayed polite and told him that his answers are not my answers and why. The point that I make is that Wikidata is a service. It will increasingly be used for the most outrageous queries and people will expect it to work because why else do we put all this data in there. Why else is this the data hub for Wikipedia. Why else
Do appreciate that the aim of the WMF is to share in the sum of all available knowledge. When the current technology is what we have to make do with, fine for now. Say so, but do not ridicule me for saying that it is not good enough, it is not now and it will certainly not be in the future... Thanks, GerardM
Gerard, it all boils down to using the right tool for the job. Nothing more - nothing less. Let's get back to making Wikidata rock.
Cheers Lydia
Hi Lydia,
I agree on using the right tool for the job. Yet, it isn’t always obvious what is right and what the limitations of a tool are.
For me, it’s perfectly ok when a query runs for 20 minutes, when it spares me some hours of setting up a specific environment for one specific dataset (and doing it again when I need current data two month later). And it would be no issue if the query runs much longer, in situations where it competes with several others. But of course, that’s not what I want to experience when I use a wikidata service to drive, e.g., an autosuggest function for selecting entities.
So, can you agree to Markus suggestion that an experimental “unstable” endpoint could solve different use cases and expectiations?
And do you think the policies and limitations of different access strategies could be documented? These could include a high-reliability interface for a narrow range of queries (as Daniel suggests as his preferred option). And on the other end of the spectrum something what allows people to experiment freely. Finally, the latter kind of interface could allow new patterns of usage to evolve, with perhaps a few of them worthwhile to become part of an optimized, highly reliabile query set.
I could imagine that such a documentation of (and perhaps discussion on) different options and access strategies, limitations and tradeoffs could solve Gerards claim to give people what they need, or at least let them make informed choises when restrictions are unavoidable.
Cheers, Joachim
Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Lydia Pintscher Gesendet: Donnerstag, 11. Februar 2016 17:55 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
On Thu, Feb 11, 2016 at 5:53 PM Gerard Meijssen <gerard.meijssen@gmail.commailto:gerard.meijssen@gmail.com> wrote: Hoi,' Markus when you read my reply on the original question you will see that my approach is different. The first thing that I pointed out was that a technical assumption has little to do with what people need. I indicated that when this is the approach, the answer is fix it. The notion that a large number of returns is outrageous is not of this time.
My approach was one where I even offered a possible solution, a crutch.
The approach Daniel took was to make me look ridiculous. His choice, not mine. I stayed polite and told him that his answers are not my answers and why. The point that I make is that Wikidata is a service. It will increasingly be used for the most outrageous queries and people will expect it to work because why else do we put all this data in there. Why else is this the data hub for Wikipedia. Why else
Do appreciate that the aim of the WMF is to share in the sum of all available knowledge. When the current technology is what we have to make do with, fine for now. Say so, but do not ridicule me for saying that it is not good enough, it is not now and it will certainly not be in the future... Thanks, GerardM
Gerard, it all boils down to using the right tool for the job. Nothing more - nothing less. Let's get back to making Wikidata rock.
Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.dehttp://www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
Hi Joachim,
Stas would be the right person to discuss service parameters and the possible setup of more servers with other parameters. He is part of the team at WMF who is in charge of the SPARQL ops.
You note that "it isn’t always obvious what is right and what the limitations of a tool are". I think this is the key point here. There is not enough experience with the SPARQL service yet to define very clear guidelines on what works and what doesn't. On this mailing list, we have frequently been reminded to use LIMIT in queries to make sure they terminate and don't overstress the server, but I guess this is not part of the official documentation you refer to. There was no decision against supporting bigger queries either -- it just did not come up as a major demand yet, since typical applications that use SPARQL so far require 10s to 1000s of results but not 100,000s to millions. To be honest, I would not have expected this to work so well in practice that it could be considered here. It is interesting to learn that you are already using SPARQL for generating custom data exports. It's probably not the most typical use of a query service, but at least the query language could support this usage in principle.
Cheers,
Markus
On 11.02.2016 19:32, Neubert, Joachim wrote:
Hi Lydia,
I agree on using the right tool for the job. Yet, it isn’t always obvious what is right and what the limitations of a tool are.
For me, it’s perfectly ok when a query runs for 20 minutes, when it spares me some hours of setting up a specific environment for one specific dataset (and doing it again when I need current data two month later). And it would be no issue if the query runs much longer, in situations where it competes with several others. But of course, that’s not what I want to experience when I use a wikidata service to drive, e.g., an autosuggest function for selecting entities.
So, can you agree to Markus suggestion that an experimental “unstable” endpoint could solve different use cases and expectiations?
And do you think the policies and limitations of different access strategies could be documented? These could include a high-reliability interface for a narrow range of queries (as Daniel suggests as his preferred option). And on the other end of the spectrum something what allows people to experiment freely. Finally, the latter kind of interface could allow new patterns of usage to evolve, with perhaps a few of them worthwhile to become part of an optimized, highly reliabile query set.
I could imagine that such a documentation of (and perhaps discussion on) different options and access strategies, limitations and tradeoffs could solve Gerards claim to give people what they need, or at least let them make informed choises when restrictions are unavoidable.
Cheers, Joachim
*Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] *Im Auftrag von *Lydia Pintscher *Gesendet:* Donnerstag, 11. Februar 2016 17:55 *An:* Discussion list for the Wikidata project. *Betreff:* Re: [Wikidata] SPARQL CONSTRUCT results truncated
On Thu, Feb 11, 2016 at 5:53 PM Gerard Meijssen <gerard.meijssen@gmail.com mailto:gerard.meijssen@gmail.com> wrote:
Hoi,' Markus when you read my reply on the original question you will see that my approach is different. The first thing that I pointed out was that a technical assumption has little to do with what people need. I indicated that when this is the approach, the answer is fix it. The notion that a large number of returns is outrageous is not of this time. My approach was one where I even offered a possible solution, a crutch. The approach Daniel took was to make me look ridiculous. His choice, not mine. I stayed polite and told him that his answers are not my answers and why. The point that I make is that Wikidata is a service. It will increasingly be used for the most outrageous queries and people will expect it to work because why else do we put all this data in there. Why else is this the data hub for Wikipedia. Why else Do appreciate that the aim of the WMF is to share in the sum of all available knowledge. When the current technology is what we have to make do with, fine for now. Say so, but do not ridicule me for saying that it is not good enough, it is not now and it will certainly not be in the future... Thanks, GerardM
Gerard, it all boils down to using the right tool for the job. Nothing more - nothing less. Let's get back to making Wikidata rock.
Cheers Lydia
--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata
Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de http://www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
For me, it’s perfectly ok when a query runs for 20 minutes, when it spares me some hours of setting up a specific environment for one specific dataset (and doing it again when I need current data two month later). And it would be no issue if the query runs much longer, in situations where it competes with several others. But of course, that’s not what I want to experience when I use a wikidata service to drive, e.g., an autosuggest function for selecting entities.
I understand that, but this is a shared server which is supposed to serve many users, and if we allow to run 20-minute queries on this service, soon enough it would become unusable. This is why we have 30-second limit on the server.
Now, we have considered having an option for the server or setup that allows to run longer queries, but currently we don't have one. It would require some budget allocation and work to make it, so it's not something we can have right now. There are use cases for very long queries and very large results, the current public service endpoint is just not good in serving them, because it's not what it was meant for.
And do you think the policies and limitations of different access strategies could be documented? These could include a high-reliability
I agree that limitations better to be documented, the problem is we don't know everything we may need to document. Such as "what are queries that may be bad". When I see something like "I want to download million-row dataset" I know it's probably a bit too much. But I can't have hard rule that says 1M-1 is ok, but 1M is too much.
preferred option). And on the other end of the spectrum something what allows people to experiment freely. Finally, the latter kind of
I'm not sure how I could maintain an endpoint that would allow people to do anything they want and still provide adequate experience for everybody. Maybe if we had infinite hardware resources... but we do not.
Otherwise, it is possible - and should not be extremely hard - to set one's own instance of the Query Service and use it for experimenting with heavy lifting. Of course, that would require resources - but there's no magic here, it'd require resources from us too, both in terms of hardware and people that would maintain it. So some things we can do now, some things we would be able to do later, and some things we probably would not be able to offer with any adequate quality.
On 2/11/16 9:25 AM, Markus Krötzsch wrote:
On 11.02.2016 15:01, Gerard Meijssen wrote:
Hoi, What I hear is that the intentions were wrong in that you did not anticipate people to get actual meaningful requests out of it.
When you state "we have two choices", you imply that it is my choice as well. It is not. The answer that I am looking for is yes, it does not function as we would like, we are working on it and in the mean time we will ensure that toolkit is available on Labs for the more complex queries.
Wikidata is a service and the service is in need of being better.
Gerard, do you realise how far away from technical reality your wishes are? We are far ahead of the state of the art in what we already have for Wikidata: two powerful live query services + a free toolkit for batch analyses + several Web APIs for live lookups. I know of no site of this scale that is anywhere near this in terms of functionality. You can always ask for more, but you should be a bit reasonable too, or people will just ignore you.
Markus
Markus and others interested in this matter,
What about using OFFSET and LIMIT to address this problem? That's what we advice users of the DBpedia endpoint (and others we publish) to do.
We have to educate people about query implications and options. Even after that, you have the issue of timeouts (which aren't part of the SPARQL spec) that can be used to produce partial results (notified via HTTP headers), but that's something that comes after the basic scrolling functionality of OFFSET and LIMIT are understood.
[1] http://stackoverflow.com/questions/20937556/how-to-get-all-companies-from-db... [2] https://sourceforge.net/p/dbpedia/mailman/message/29172307/
On 13.02.2016 23:50, Kingsley Idehen wrote: ...
Markus and others interested in this matter,
What about using OFFSET and LIMIT to address this problem? That's what we advice users of the DBpedia endpoint (and others we publish) to do.
We have to educate people about query implications and options. Even after that, you have the issue of timeouts (which aren't part of the SPARQL spec) that can be used to produce partial results (notified via HTTP headers), but that's something that comes after the basic scrolling functionality of OFFSET and LIMIT are understood.
I think this does not help here. If I only ask for part of the data (see my previous email), I can get all 300K results in 9.3sec. The size of the result does not seem to be the issue. If I add further joins to the query, the time needed seems to go above 10sec (timeout) even with a LIMIT. Note that you need to order results for using LIMIT in a reliable way, since the data changes by the minute and the "natural" order of results would change as well. I guess with a blocking operator like ORDER BY in the equation, the use of LIMIT does not really save much time (other than for final result serialisation and transfer, which seems pretty quick).
Markus
[1] http://stackoverflow.com/questions/20937556/how-to-get-all-companies-from-db... [2] https://sourceforge.net/p/dbpedia/mailman/message/29172307/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 2/13/16 6:26 PM, Markus Kroetzsch wrote:
On 13.02.2016 23:50, Kingsley Idehen wrote: ...
Markus and others interested in this matter,
What about using OFFSET and LIMIT to address this problem? That's what we advice users of the DBpedia endpoint (and others we publish) to do.
We have to educate people about query implications and options. Even after that, you have the issue of timeouts (which aren't part of the SPARQL spec) that can be used to produce partial results (notified via HTTP headers), but that's something that comes after the basic scrolling functionality of OFFSET and LIMIT are understood.
I think this does not help here. If I only ask for part of the data (see my previous email), I can get all 300K results in 9.3sec. The size of the result does not seem to be the issue. If I add further joins to the query, the time needed seems to go above 10sec (timeout) even with a LIMIT. Note that you need to order results for using LIMIT in a reliable way, since the data changes by the minute and the "natural" order of results would change as well. I guess with a blocking operator like ORDER BY in the equation, the use of LIMIT does not really save much time (other than for final result serialisation and transfer, which seems pretty quick).
Markus
Markus,
LIMIT isn't the key element in my example since all it does is set cursor size. It's the use of OFFSET to move the cursor through positions in the solution that's key here.
Fundamentally, this is about using HTTP GET requests to page through the data if a single query solution is either too large or its preparation exceeds underlying DBMS timeout settings.
Ultimately, developers have to understand these time-tested techniques for working with data.
Kingsley
[1] http://stackoverflow.com/questions/20937556/how-to-get-all-companies-from-db...
[2] https://sourceforge.net/p/dbpedia/mailman/message/29172307/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
We basically have two choices: either we offer a limited interface that only allows for a narrow range of queries to be run at all. Or we offer a very general interface that can run arbitrary queries, but we impose limits on time and memory consumption. I would actually prefer the first option, because it's more predictable, and doesn't get people's hopes up too far. What do you think?
That would require implementing pretty smart SPARQL parser... I don't think it worth the investment of time. I'd rather put caps on runtime and maybe also on parallel queries per IP, to ensure fair access. We may also have a way to run longer queries - in fact, we'll need it anyway if we want to automate lists - but that is longer term, we'll need to figure out infrastructure for that and how we allocate access.
Hoi, This is the kind of (technical) feedback that makes sense as it is centred on need. It acknowledges that more needs to be done as we are not ready for what we expect of ourselves in the first place.
In this day and age of big data, we are a very public place where a lot of initiatives gravitate to. If the WMF wants to retain its relevance, it is to face its challenges. Maybe the WDQS can steal a page out of the architecture in what Magnus build. It is very much replicable and multiple instances have been running. This is not to say that it becomes more and more relevant to have the Wikidata toolkit available from Labs with as many instances as needed. Thanks, GerardM
On 12 February 2016 at 00:04, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
We basically have two choices: either we offer a limited interface that
only
allows for a narrow range of queries to be run at all. Or we offer a very general interface that can run arbitrary queries, but we impose limits
on time
and memory consumption. I would actually prefer the first option,
because it's
more predictable, and doesn't get people's hopes up too far. What do you
think?
That would require implementing pretty smart SPARQL parser... I don't think it worth the investment of time. I'd rather put caps on runtime and maybe also on parallel queries per IP, to ensure fair access. We may also have a way to run longer queries - in fact, we'll need it anyway if we want to automate lists - but that is longer term, we'll need to figure out infrastructure for that and how we allocate access.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 12.02.2016 00:04, Stas Malyshev wrote:
Hi!
We basically have two choices: either we offer a limited interface that only allows for a narrow range of queries to be run at all. Or we offer a very general interface that can run arbitrary queries, but we impose limits on time and memory consumption. I would actually prefer the first option, because it's more predictable, and doesn't get people's hopes up too far. What do you think?
That would require implementing pretty smart SPARQL parser... I don't think it worth the investment of time. I'd rather put caps on runtime and maybe also on parallel queries per IP, to ensure fair access. We may also have a way to run longer queries - in fact, we'll need it anyway if we want to automate lists - but that is longer term, we'll need to figure out infrastructure for that and how we allocate access.
+1
Restricting queries syntactically to be "simpler" is what we did in Semantic MediaWiki (because MySQL did not support time/memory limits per query). It is a workaround, but it will not prevent long-running queries unless you make the syntactic restrictions really severe (and thereby forbid many simple queries, too). I would not do it if there is support for time/memory limits instead.
In the end, even the SPARQL engines are not able to predict reliably how complicated a query is going to be -- it's an important part of their work (for optimising query execution), but it is also very difficult.
Markus
12.02.2016, 10:43, Markus Krötzsch wrote:
Restricting queries syntactically to be "simpler" is what we did in Semantic MediaWiki (because MySQL did not support time/memory limits per query). It is a workaround, but it will not prevent long-running queries unless you make the syntactic restrictions really severe (and thereby forbid many simple queries, too). I would not do it if there is support for time/memory limits instead.
Would providing a Linked Data Fragments server [1] help here? It seems to be designed exactly for situations like this, where you want to provide a SPARQL query service a large amount of linked data, but are worried about server performance particularly for complex, long-running queries. Linked Data Fragments pushes some of the heavy processing to the client side, which parses and executes the SPARQL queries.
Dynamically updating the data might be an issue here, but some of the server implementations support running on top of a SPARQL endpoint [2]. I think that from the perspective of the server this means that a heavy, long-running SPARQL query is broken up already on the client side into several small, simple SPARQL queries that are relatively easy to serve.
-Osma
[1] http://linkeddatafragments.org/
[2] https://github.com/LinkedDataFragments/Server.js#configure-the-data-sources
On 12.02.2016 10:01, Osma Suominen wrote:
12.02.2016, 10:43, Markus Krötzsch wrote:
Restricting queries syntactically to be "simpler" is what we did in Semantic MediaWiki (because MySQL did not support time/memory limits per query). It is a workaround, but it will not prevent long-running queries unless you make the syntactic restrictions really severe (and thereby forbid many simple queries, too). I would not do it if there is support for time/memory limits instead.
Would providing a Linked Data Fragments server [1] help here? It seems to be designed exactly for situations like this, where you want to provide a SPARQL query service a large amount of linked data, but are worried about server performance particularly for complex, long-running queries. Linked Data Fragments pushes some of the heavy processing to the client side, which parses and executes the SPARQL queries.
Dynamically updating the data might be an issue here, but some of the server implementations support running on top of a SPARQL endpoint [2]. I think that from the perspective of the server this means that a heavy, long-running SPARQL query is broken up already on the client side into several small, simple SPARQL queries that are relatively easy to serve.
There already is such a service for Wikidata (Cristian Consonni has set it up a while ago). You could try if the query would work there. I think that such queries would be rather challenging for a server of this type, since they require you to iterate almost all of the data client-side. Note that both "instance of human" and "has a GND identifier" are not very selective properties. In this sense, the queries may not be "relatively easy to serve" in this particular case.
Markus
-Osma
[1] http://linkeddatafragments.org/
[2] https://github.com/LinkedDataFragments/Server.js#configure-the-data-sources
It's great how this discussion evolves - thanks to everybody!
Technically, I completely agree that in practice it may prove impossible to predict the load a query will produce. Relational databases have invested years and years in query optimization (e.g., Oracles cost based optimizer, which relies on extended statistics gathered during runtime), and I can't see that similar investments are possible for triple stores.
What I could imagine for public endpoints is the SPARQL engine monitoring and prioritizing queries: the longer a query already runs, or the more resources it has already used, the lower its priority is re-scheduled (up to some final limit). But this is just a theoretical consideration, I'm not aware of any system that implements anything like this - and it could be implemented only in the engine itself.
For ZBWs SPARQL endpoints, I've implemented a much simpler three-level strategy, which does not involve the engine at all:
1. Endpoints which drive production-level services (e.g. autosuggest or retrieval enhancement functions). These endpoints run on separate machines and offer completely encapsulated services via a public API (http://zbw.eu/beta/econ-ws), without any direct SPARQL access.
2. Public "beta" endpoints (http://zbw.eu/beta/sparql). These offer unrestricted SPARQL access, but without any garanties about performance or availability - though of course I do my best to keep these up and running. They run on an own virtual machine, and should not hurt any other parts of the infrastructure when getting overloaded or out of control.
3. Public "experimental" endpoints. These include in particular an endpoint for the GND dataset with 130m triples. It was mainly created for internal use because (to my best knowledge) no other public GND endpoint exists. The endpoint is not linked from the GND pages of DNB, and I've advertised it very low-key on a few mailing lists. For these experimental endpoints, we reserve the right to shut them down for the public if they get flooded with more requests than they can handle.
It may be of interest, that up to now, on none of these public endpoints we came across issues with attacks or evil-minded queries (which were a matter of concern when I started this in 2009), nor with longer-lasting massive access. Of course, that is different for Wikidata, where the data is of interest for _much_ more people. But if anyhow affordable, I'd like to encourage offering some kind of experimental access with really wide limits in an "unstable" setting, in addition to the reliable services. For most people who just want to check out something, it's not an option to download the whole dataset and set up an infrastructure for it. For us, this was an issue with even the much smaller GND set.
The Linked data fragments approach Osma mentioned is very interesting (particularly the bit about setting it up on top of an regularily updated existing endpoint), and could provide another alternative, but I have not yet experimented with it.
Have a fine weekend - Joachim
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Markus Krötzsch Gesendet: Freitag, 12. Februar 2016 09:44 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
On 12.02.2016 00:04, Stas Malyshev wrote:
Hi!
We basically have two choices: either we offer a limited interface that only allows for a narrow range of queries to be run at all. Or we offer a very general interface that can run arbitrary queries, but we impose limits on time and memory consumption. I would actually prefer the first option, because it's more predictable, and doesn't get people's hopes up too far. What do you think?
That would require implementing pretty smart SPARQL parser... I don't think it worth the investment of time. I'd rather put caps on runtime and maybe also on parallel queries per IP, to ensure fair access. We may also have a way to run longer queries - in fact, we'll need it anyway if we want to automate lists - but that is longer term, we'll need to figure out infrastructure for that and how we allocate access.
+1
Restricting queries syntactically to be "simpler" is what we did in Semantic MediaWiki (because MySQL did not support time/memory limits per query). It is a workaround, but it will not prevent long-running queries unless you make the syntactic restrictions really severe (and thereby forbid many simple queries, too). I would not do it if there is support for time/memory limits instead.
In the end, even the SPARQL engines are not able to predict reliably how complicated a query is going to be -- it's an important part of their work (for optimising query execution), but it is also very difficult.
Markus
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
The Linked data fragments approach Osma mentioned is very interesting (particularly the bit about setting it up on top of an regularily updated existing endpoint), and could provide another alternative, but I have not yet experimented with it.
There is apparently this: https://github.com/CristianCantoro/wikidataldf though not sure what it its status - I just found it.
In general, yes, I think checking out LDF may be a good idea. I'll put it on my todo list.
Hi,
you may want to check out the Linked Data Fragment server in Blazegraph: https://github.com/blazegraph/BlazegraphBasedTPFServer
Cheers, Peter
On 13.02.2016, at 01:33, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
The Linked data fragments approach Osma mentioned is very interesting (particularly the bit about setting it up on top of an regularily updated existing endpoint), and could provide another alternative, but I have not yet experimented with it.
There is apparently this: https://github.com/CristianCantoro/wikidataldf though not sure what it its status - I just found it.
In general, yes, I think checking out LDF may be a good idea. I'll put it on my todo list.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
And here is another comment on this interesting topic :-)
I just realised how close the service is to answering the query. It turns out that you can in fact get the whole set of (currently >324000 result items) together with their GND identifiers as a download *within the timeout* (I tried several times without any errors). This is a 63M json result file with >640K individual values, and it downloads in no time on my home network. The query I use is simply this:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) LIMIT 10
(don't run this in vain: even with the limit, the ORDER clause requires the service to compute all results every time someone runs this. Also be careful when removing the limit; your browser may hang on an HTML page that large; better use the SPARQL endpoint directly to download the complete result file.)
It seems that the timeout is only hit when adding more information (labels and wiki URLs) to the result.
So it seems that we are not actually very far away from being able to answer the original query even within the timeout. Certainly not as far away as I first thought. It might not be necessary at all to switch to a different approach (though it would be interesting to know how long LDF takes to answer the above -- our current service takes less than 10sec).
Cheers,
Markus
On 13.02.2016 11:40, Peter Haase wrote:
Hi,
you may want to check out the Linked Data Fragment server in Blazegraph: https://github.com/blazegraph/BlazegraphBasedTPFServer
Cheers, Peter
On 13.02.2016, at 01:33, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
The Linked data fragments approach Osma mentioned is very interesting (particularly the bit about setting it up on top of an regularily updated existing endpoint), and could provide another alternative, but I have not yet experimented with it.
There is apparently this: https://github.com/CristianCantoro/wikidataldf though not sure what it its status - I just found it.
In general, yes, I think checking out LDF may be a good idea. I'll put it on my todo list.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 13.02.2016 22:56, Markus Kroetzsch wrote:
And here is another comment on this interesting topic :-)
I just realised how close the service is to answering the query. It turns out that you can in fact get the whole set of (currently >324000 result items) together with their GND identifiers as a download *within the timeout* (I tried several times without any errors). This is a 63M json result file with >640K individual values, and it downloads in no time on my home network. The query I use is simply this:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) LIMIT 10
(don't run this in vain: even with the limit, the ORDER clause requires the service to compute all results every time someone runs this. Also be careful when removing the limit; your browser may hang on an HTML page that large; better use the SPARQL endpoint directly to download the complete result file.)
P.S. For those who are interested, here is the direct link to the complete result (remove the line break [1]):
https: //query.wikidata.org/sparql?query=PREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0D%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0D%0Aselect+%3Fitem+%3FgndId+where+{+%3Fitem+wdt%3AP227+%3FgndId+%3B+wdt%3AP31++wd%3AQ5+.+}+ORDER+BY+ASC%28%3FgndId%29&format=json
Markus
[1] Is the service protected against internet crawlers that find such links in the online logs of this email list? It would be a pity if we would have to answer this query tens of thousands of times for many years to come just to please some spiders who have no use for the result.
It seems that the timeout is only hit when adding more information (labels and wiki URLs) to the result.
So it seems that we are not actually very far away from being able to answer the original query even within the timeout. Certainly not as far away as I first thought. It might not be necessary at all to switch to a different approach (though it would be interesting to know how long LDF takes to answer the above -- our current service takes less than 10sec).
Cheers,
Markus
On 13.02.2016 11:40, Peter Haase wrote:
Hi,
you may want to check out the Linked Data Fragment server in Blazegraph: https://github.com/blazegraph/BlazegraphBasedTPFServer
Cheers, Peter
On 13.02.2016, at 01:33, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
The Linked data fragments approach Osma mentioned is very interesting (particularly the bit about setting it up on top of an regularily updated existing endpoint), and could provide another alternative, but I have not yet experimented with it.
There is apparently this: https://github.com/CristianCantoro/wikidataldf though not sure what it its status - I just found it.
In general, yes, I think checking out LDF may be a good idea. I'll put it on my todo list.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
[1] Is the service protected against internet crawlers that find such links in the online logs of this email list? It would be a pity if we would have to answer this query tens of thousands of times for many years to come just to please some spiders who have no use for the result.
That's a very good point. We currently do not have robots.txt file on the service. We should have it. I'll fix it ASAP.
GUI links do not run the query until click, so they are safe from bots anyway. But direct links to sparql endpoint do run the query (it's the API after all :) So robots.txt is needed there.
On 2/13/16 4:56 PM, Markus Kroetzsch wrote:
And here is another comment on this interesting topic :-)
I just realised how close the service is to answering the query. It turns out that you can in fact get the whole set of (currently >324000 result items) together with their GND identifiers as a download *within the timeout* (I tried several times without any errors). This is a 63M json result file with >640K individual values, and it downloads in no time on my home network. The query I use is simply this:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) LIMIT 10
(don't run this in vain: even with the limit, the ORDER clause requires the service to compute all results every time someone runs this. Also be careful when removing the limit; your browser may hang on an HTML page that large; better use the SPARQL endpoint directly to download the complete result file.)
It seems that the timeout is only hit when adding more information (labels and wiki URLs) to the result.
So it seems that we are not actually very far away from being able to answer the original query even within the timeout. Certainly not as far away as I first thought. It might not be necessary at all to switch to a different approach (though it would be interesting to know how long LDF takes to answer the above -- our current service takes less than 10sec).
Cheers,
Markus
For a page-size of 20 (covered by LIMIT) you can move through offets of 20 via:
First call:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) OFFSET 10 LIMIT 10
Next call:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) OFFSET 20 LIMIT 10
Subsequent Calls:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) OFFSET {last-offset-plus-20} LIMIT 10
Remember, you simply change the OFFSET value in the SPARQL HTTP URL.
On 13.02.2016 23:56, Kingsley Idehen wrote:
On 2/13/16 4:56 PM, Markus Kroetzsch wrote:
...
For a page-size of 20 (covered by LIMIT) you can move through offets of 20 via:
To clarify: I just added the LIMIT to prevent unwary readers from killing their browser on a 100MB HTML result page. The server does not need it at all and can give you all result at once. Online applications may still want to scroll results, I agree, but for the OP it would be more useful to just donwload one file here.
Markus
First call:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) OFFSET 10 LIMIT 10
Next call:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) OFFSET 20 LIMIT 10
Subsequent Calls:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) OFFSET {last-offset-plus-20} LIMIT 10
Remember, you simply change the OFFSET value in the SPARQL HTTP URL.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 2/13/16 6:29 PM, Markus Kroetzsch wrote:
On 13.02.2016 23:56, Kingsley Idehen wrote:
On 2/13/16 4:56 PM, Markus Kroetzsch wrote:
...
For a page-size of 20 (covered by LIMIT) you can move through offets of 20 via:
To clarify: I just added the LIMIT to prevent unwary readers from killing their browser on a 100MB HTML result page. The server does not need it at all and can give you all result at once. Online applications may still want to scroll results, I agree, but for the OP it would be more useful to just donwload one file here.
Markus
Scrolling or Paging through query solutions is a technique beneficial to clients and servers. Understanding the concept has to be part of the narrative for working with SPARQL query solutions.
This is about flexibility via usage of the full functionality of SPARQL as most developers and users simply execute queries without factoring in these techniques or the impact of their queries on other system users etc..
Hi Markus,
Great that you checked that out. I can confirm that the simplified query worked for me, too. It took 15.6s and revealed roughly the same number of results (323789).
When I loaded the results into http://zbw.eu/beta/sparql/econ_pers/query, an endpoint for "economics-related" persons, it matched with 36050 persons (supposedly the "most important" 8 percent of our set).
What I normally would do to get the according Wikipedia site URLs, is a query against the wikidata endpoint, which references the relevant wikidata URIs via a "service" clause:
PREFIX skos: http://www.w3.org/2004/02/skos/core# PREFIX schema: http://schema.org/ # construct { ?gnd schema:about ?sitelink . } where { service http://zbw.eu/beta/sparql/econ_pers/query { ?gnd skos:prefLabel [] ; skos:exactMatch ?wd . filter(contains(str(?wd), 'wikidata')) } ?sitelink schema:about ?wd ; schema:inLanguage ?language . filter (contains(str(?sitelink), 'wikipedia')) filter (lang(?wdLabel) = ?language && ?language in ('en', 'de')) }
This however results in a java error.
If "service" clauses are supposed to work in the wikidata endpoint, I'd happily provide addtitional details in phabricator.
For now, I'll get the data via your java example code :)
Cheers, Joachim
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Markus Kroetzsch Gesendet: Samstag, 13. Februar 2016 22:56 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
And here is another comment on this interesting topic :-)
I just realised how close the service is to answering the query. It turns out that you can in fact get the whole set of (currently >324000 result items) together with their GND identifiers as a download *within the timeout* (I tried several times without any errors). This is a 63M json result file with >640K individual values, and it downloads in no time on my home network. The query I use is simply this:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) LIMIT 10
(don't run this in vain: even with the limit, the ORDER clause requires the service to compute all results every time someone runs this. Also be careful when removing the limit; your browser may hang on an HTML page that large; better use the SPARQL endpoint directly to download the complete result file.)
It seems that the timeout is only hit when adding more information (labels and wiki URLs) to the result.
So it seems that we are not actually very far away from being able to answer the original query even within the timeout. Certainly not as far away as I first thought. It might not be necessary at all to switch to a different approach (though it would be interesting to know how long LDF takes to answer the above -- our current service takes less than 10sec).
Cheers,
Markus
On 13.02.2016 11:40, Peter Haase wrote:
Hi,
you may want to check out the Linked Data Fragment server in Blazegraph: https://github.com/blazegraph/BlazegraphBasedTPFServer
Cheers, Peter
On 13.02.2016, at 01:33, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
The Linked data fragments approach Osma mentioned is very interesting (particularly the bit about setting it up on top of an regularily updated existing endpoint), and could provide another alternative, but I have not yet experimented with it.
There is apparently this: https://github.com/CristianCantoro/wikidataldf though not sure what it its status - I just found it.
In general, yes, I think checking out LDF may be a good idea. I'll put it on my todo list.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Joachim,
I think SERVICE queries should be working, but maybe Stas knows more about this. Even if they are disabled, this should not result in some message rather than in a NullPointerException. Looks like a bug.
Markus
On 16.02.2016 13:56, Neubert, Joachim wrote:
Hi Markus,
Great that you checked that out. I can confirm that the simplified query worked for me, too. It took 15.6s and revealed roughly the same number of results (323789).
When I loaded the results into http://zbw.eu/beta/sparql/econ_pers/query, an endpoint for "economics-related" persons, it matched with 36050 persons (supposedly the "most important" 8 percent of our set).
What I normally would do to get the according Wikipedia site URLs, is a query against the wikidata endpoint, which references the relevant wikidata URIs via a "service" clause:
PREFIX skos: http://www.w3.org/2004/02/skos/core# PREFIX schema: http://schema.org/ # construct { ?gnd schema:about ?sitelink . } where { service http://zbw.eu/beta/sparql/econ_pers/query { ?gnd skos:prefLabel [] ; skos:exactMatch ?wd . filter(contains(str(?wd), 'wikidata')) } ?sitelink schema:about ?wd ; schema:inLanguage ?language . filter (contains(str(?sitelink), 'wikipedia')) filter (lang(?wdLabel) = ?language && ?language in ('en', 'de')) }
This however results in a java error.
If "service" clauses are supposed to work in the wikidata endpoint, I'd happily provide addtitional details in phabricator.
For now, I'll get the data via your java example code :)
Cheers, Joachim
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Markus Kroetzsch Gesendet: Samstag, 13. Februar 2016 22:56 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
And here is another comment on this interesting topic :-)
I just realised how close the service is to answering the query. It turns out that you can in fact get the whole set of (currently >324000 result items) together with their GND identifiers as a download *within the timeout* (I tried several times without any errors). This is a 63M json result file with >640K individual values, and it downloads in no time on my home network. The query I use is simply this:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) LIMIT 10
(don't run this in vain: even with the limit, the ORDER clause requires the service to compute all results every time someone runs this. Also be careful when removing the limit; your browser may hang on an HTML page that large; better use the SPARQL endpoint directly to download the complete result file.)
It seems that the timeout is only hit when adding more information (labels and wiki URLs) to the result.
So it seems that we are not actually very far away from being able to answer the original query even within the timeout. Certainly not as far away as I first thought. It might not be necessary at all to switch to a different approach (though it would be interesting to know how long LDF takes to answer the above -- our current service takes less than 10sec).
Cheers,
Markus
On 13.02.2016 11:40, Peter Haase wrote:
Hi,
you may want to check out the Linked Data Fragment server in Blazegraph: https://github.com/blazegraph/BlazegraphBasedTPFServer
Cheers, Peter
On 13.02.2016, at 01:33, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
The Linked data fragments approach Osma mentioned is very interesting (particularly the bit about setting it up on top of an regularily updated existing endpoint), and could provide another alternative, but I have not yet experimented with it.
There is apparently this: https://github.com/CristianCantoro/wikidataldf though not sure what it its status - I just found it.
In general, yes, I think checking out LDF may be a good idea. I'll put it on my todo list.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Thanks Markus, I've created https://phabricator.wikimedia.org/T127070 with the details.
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Markus Krötzsch Gesendet: Dienstag, 16. Februar 2016 14:57 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
Hi Joachim,
I think SERVICE queries should be working, but maybe Stas knows more about this. Even if they are disabled, this should not result in some message rather than in a NullPointerException. Looks like a bug.
Markus
On 16.02.2016 13:56, Neubert, Joachim wrote:
Hi Markus,
Great that you checked that out. I can confirm that the simplified query worked for me, too. It took 15.6s and revealed roughly the same number of results (323789).
When I loaded the results into http://zbw.eu/beta/sparql/econ_pers/query, an endpoint for "economics-related" persons, it matched with 36050 persons (supposedly the "most important" 8 percent of our set).
What I normally would do to get the according Wikipedia site URLs, is a query against the wikidata endpoint, which references the relevant wikidata URIs via a "service" clause:
PREFIX skos: http://www.w3.org/2004/02/skos/core# PREFIX schema: http://schema.org/ # construct { ?gnd schema:about ?sitelink . } where { service http://zbw.eu/beta/sparql/econ_pers/query { ?gnd skos:prefLabel [] ; skos:exactMatch ?wd . filter(contains(str(?wd), 'wikidata')) } ?sitelink schema:about ?wd ; schema:inLanguage ?language . filter (contains(str(?sitelink), 'wikipedia')) filter (lang(?wdLabel) = ?language && ?language in ('en', 'de')) }
This however results in a java error.
If "service" clauses are supposed to work in the wikidata endpoint, I'd happily provide addtitional details in phabricator.
For now, I'll get the data via your java example code :)
Cheers, Joachim
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Markus Kroetzsch Gesendet: Samstag, 13. Februar 2016 22:56 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
And here is another comment on this interesting topic :-)
I just realised how close the service is to answering the query. It turns out that you can in fact get the whole set of (currently >324000 result items) together with their GND identifiers as a download *within the timeout* (I tried several times without any errors). This is a 63M json result file with >640K individual values, and it downloads in no time on my home network. The query I use is simply this:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) LIMIT 10
(don't run this in vain: even with the limit, the ORDER clause requires the service to compute all results every time someone runs this. Also be careful when removing the limit; your browser may hang on an HTML page that large; better use the SPARQL endpoint directly to download the complete result file.)
It seems that the timeout is only hit when adding more information (labels and wiki URLs) to the result.
So it seems that we are not actually very far away from being able to answer the original query even within the timeout. Certainly not as far away as I first thought. It might not be necessary at all to switch to a different approach (though it would be interesting to know how long LDF takes to answer the above -- our current service takes less than 10sec).
Cheers,
Markus
On 13.02.2016 11:40, Peter Haase wrote:
Hi,
you may want to check out the Linked Data Fragment server in Blazegraph: https://github.com/blazegraph/BlazegraphBasedTPFServer
Cheers, Peter
On 13.02.2016, at 01:33, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
The Linked data fragments approach Osma mentioned is very interesting (particularly the bit about setting it up on top of an regularily updated existing endpoint), and could provide another alternative, but I have not yet experimented with it.
There is apparently this: https://github.com/CristianCantoro/wikidataldf though not sure what it its status - I just found it.
In general, yes, I think checking out LDF may be a good idea. I'll put it on my todo list.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
you may want to check out the Linked Data Fragment server in Blazegraph: https://github.com/blazegraph/BlazegraphBasedTPFServer
Thanks, I will check it out!
Hi Joachim,
Here is a short program that solves your problem:
https://github.com/Wikidata/Wikidata-Toolkit-Examples/blob/master/src/exampl...
It is in Java, so, you need that (and Maven) to run it, but that's the only technical challenge ;-). You can run the program in various ways as described in the README:
https://github.com/Wikidata/Wikidata-Toolkit-Examples
The program I wrote puts everything into a CSV file, but you can of course also write RDF triples if you prefer this, or any other format you wish. The code should be easy to modify.
On a first run, the tool will download the current Wikidata dump, which takes a while (it's about 6G), but after this you can find and serialise all results in less than half an hour (for a processing rate of around 10K items/second). A regular laptop is enough to run it.
Cheers,
Markus
On 11.02.2016 01:34, Stas Malyshev wrote:
Hi!
I try to extract all mappings from wikidata to the GND authority file, along with the according wikipedia pages, expecting roughly 500,000 to 1m triples as result.
As a starting note, I don't think extracting 1M triples may be the best way to use query service. If you need to do processing that returns such big result sets - in millions - maybe processing the dump - e.g. with wikidata toolkit at https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?
However, with various calls, I get much less triples (about 2,000 to 10,000). The output seems to be truncated in the middle of a statement, e.g.
It may be some kind of timeout because of the quantity of the data being sent. How long does such request take?
Hi Marcus,
thank you very much, your code will be extremely helpful for solving my current need. And though not a Java programmer, I may be even able to adjust it to similar queries.
On the other side, it's some steps away from the promises of Linked data and SPARQL endpoints. I extremely value the wikidata endpoint for having the current data, so if I add some bit in the user interface, I can query for it immediately afterwards, and I can do this in a uniform way via standard SPARQL queries. I can imagine how hard that was to achieve.
And I completely agree that it's impossible to build a SPARQL endpoint which reliably serves arbitrary comlex queries for multiple users in finite time. (This is the reason why all our public endpoints at http://zbw.eu/beta/sparql/ are labeled beta.) And you easily can get at a point, where some ill-behaved query is run over and over again by some stupid program, and you have to be quite restrictive to keep your service up.
So an "unstable" endpoint with wider limits, as you suggested in your later mail, could be a great solution for this. In both instances, it would be nice if the policy and the actual limits could be documented, so users would know what to expect (and how to act appropriate as good citizens).
Thanks again for the code, and for taking up the discussion.
Cheers, Joachim
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Markus Krötzsch Gesendet: Donnerstag, 11. Februar 2016 15:05 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
Hi Joachim,
Here is a short program that solves your problem:
https://github.com/Wikidata/Wikidata-Toolkit-Examples/blob/master/src/exampl...
It is in Java, so, you need that (and Maven) to run it, but that's the only technical challenge ;-). You can run the program in various ways as described in the README:
https://github.com/Wikidata/Wikidata-Toolkit-Examples
The program I wrote puts everything into a CSV file, but you can of course also write RDF triples if you prefer this, or any other format you wish. The code should be easy to modify.
On a first run, the tool will download the current Wikidata dump, which takes a while (it's about 6G), but after this you can find and serialise all results in less than half an hour (for a processing rate of around 10K items/second). A regular laptop is enough to run it.
Cheers,
Markus
On 11.02.2016 01:34, Stas Malyshev wrote:
Hi!
I try to extract all mappings from wikidata to the GND authority file, along with the according wikipedia pages, expecting roughly 500,000 to 1m triples as result.
As a starting note, I don't think extracting 1M triples may be the best way to use query service. If you need to do processing that returns such big result sets - in millions - maybe processing the dump
- e.g. with wikidata toolkit at
https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?
However, with various calls, I get much less triples (about 2,000 to 10,000). The output seems to be truncated in the middle of a statement, e.g.
It may be some kind of timeout because of the quantity of the data being sent. How long does such request take?
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Stas,
Thanks for your answer. You asked how long the query runs: 8.21 sec (having processed 6443 triples), in an example invocation. If roughly linear, that could mean 800-1500 sec for the whole set. However, I would expect a clearly shorter runtime: I routinely use queries of similar complexity and result sizes on ZBW's public endpoints. One arbitrary selected query which extracts data from GND runs for less than two minutes to produce 1.2m triples.
Given the size of Wikidata, I wouldn't consider such an use abusive. Of course, if you have lots of competing queries and resources are limited, it is completely legitimate to implement some policy which formulates limits and enforces them technically (throddle down long-running queries, or limit the number of produced triples, or the execution time, or whatever seems reasonable and can be implemented).
Anyway, in this case (truncation in the middle of a statement), it looks much more like some technical bug (or an obscure timeout somewhere down the way). The execution time and the result size varies widely:
5.44s empty result 8.60s 2090 triples 5.44s empty result 22.70s 27352 triples
Can you reproduce this kind of results with the given query, or with other supposedly longer-running queries?
Thanks again for looking into this.
Cheers, Joachim
PS. I plan to set up an own Wikidata SPAQL endpoint to do more complex things, but that depends on a new machine which will be available in some month. For now, I'd just like to know which for "our" persons (economists and the like) have wikipedia pages.
PPS. From my side, I would much more have liked to build a query which asks for exactly the GND IDs I'm interested in (about 430.000 out of millions of GNDs). This would have led to a much smaller result - but I cannot squeeze that query into a GET request ...
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Stas Malyshev Gesendet: Donnerstag, 11. Februar 2016 01:35 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
Hi!
I try to extract all mappings from wikidata to the GND authority file, along with the according wikipedia pages, expecting roughly 500,000 to 1m triples as result.
As a starting note, I don't think extracting 1M triples may be the best way to use query service. If you need to do processing that returns such big result sets - in millions - maybe processing the dump - e.g. with wikidata toolkit at https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?
However, with various calls, I get much less triples (about 2,000 to 10,000). The output seems to be truncated in the middle of a statement, e.g.
It may be some kind of timeout because of the quantity of the data being sent. How long does such request take?
-- Stas Malyshev smalyshev@wikimedia.org
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Joachim,
I think the problem is not to answer your query in 5min or so (Wikidata Toolkit on my laptop takes 27min without a database, by simply parsing the whole data file, so any database that already has the data should be much faster). The bigger issue is that you would have to configure the site to run for 5min before timeout. This would mean that other queries that never terminate (because they are really hard) also can run for at least this time. It seems that this could easily cause the service to break down.
Maybe one could have an "unstable" service on a separate machine that does the same as WDQS but with a much more liberal timeout and less availability (if it's overloaded a lot, it will just be down more often, but you would know when you use it that this is the deal).
Cheers,
Markus
On 11.02.2016 15:54, Neubert, Joachim wrote:
Hi Stas,
Thanks for your answer. You asked how long the query runs: 8.21 sec (having processed 6443 triples), in an example invocation. If roughly linear, that could mean 800-1500 sec for the whole set. However, I would expect a clearly shorter runtime: I routinely use queries of similar complexity and result sizes on ZBW's public endpoints. One arbitrary selected query which extracts data from GND runs for less than two minutes to produce 1.2m triples.
Given the size of Wikidata, I wouldn't consider such an use abusive. Of course, if you have lots of competing queries and resources are limited, it is completely legitimate to implement some policy which formulates limits and enforces them technically (throddle down long-running queries, or limit the number of produced triples, or the execution time, or whatever seems reasonable and can be implemented).
Anyway, in this case (truncation in the middle of a statement), it looks much more like some technical bug (or an obscure timeout somewhere down the way). The execution time and the result size varies widely:
5.44s empty result 8.60s 2090 triples 5.44s empty result 22.70s 27352 triples
Can you reproduce this kind of results with the given query, or with other supposedly longer-running queries?
Thanks again for looking into this.
Cheers, Joachim
PS. I plan to set up an own Wikidata SPAQL endpoint to do more complex things, but that depends on a new machine which will be available in some month. For now, I'd just like to know which for "our" persons (economists and the like) have wikipedia pages.
PPS. From my side, I would much more have liked to build a query which asks for exactly the GND IDs I'm interested in (about 430.000 out of millions of GNDs). This would have led to a much smaller result - but I cannot squeeze that query into a GET request ...
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Stas Malyshev Gesendet: Donnerstag, 11. Februar 2016 01:35 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
Hi!
I try to extract all mappings from wikidata to the GND authority file, along with the according wikipedia pages, expecting roughly 500,000 to 1m triples as result.
As a starting note, I don't think extracting 1M triples may be the best way to use query service. If you need to do processing that returns such big result sets - in millions - maybe processing the dump - e.g. with wikidata toolkit at https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?
However, with various calls, I get much less triples (about 2,000 to 10,000). The output seems to be truncated in the middle of a statement, e.g.
It may be some kind of timeout because of the quantity of the data being sent. How long does such request take?
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata