From Stas' answer to https://phabricator.wikimedia.org/T127070 I learned the Wikidata Query Service does not "allow external federated queries ... for security reasons (it's basically open proxy)."
Now, obviously endpoints referenced in a federated query via a service clause have to be open - so any attacker could send his queries directly instead of squeezing them through some other endpoint. The only scenario I can think of is that an attackers IP already is blocked by the attacked site. If (instead of much more common ways to fake an IP) the attacker would choose to do it by federated queries through WDQS, this _could_ result in WDQS being blocked by this endpoint.
This is a quite unlikely scenario - in the last 7 years I'm on SPARQL mailing lists I cannot remember this kind of attack of ever having been reported - but of cause it is legitimate to secure production environments against any conceivable attack vector.
However, I think it should be possible to query Wikidata with this kind of query. Federated SPARQL queries are a basic building block for Linked Open Data, and blocking it breaks many uses Wikidata could provide for the linked data cloud. This must not involve the highly-protected production environment, but could be solved by an additional unstable/experimental endpoint under another address.
As an additional illustrating argument: There is an immense difference between referencing something in a service clause and getting a result in a few seconds, or having to use the Wikidata toolkit. To get the initial query for this thread answered by the example program Markus kindly provided at https://github.com/Wikidata/Wikidata-Toolkit-Examples/blob/master/src/exampl... (and which worked perfectly - thanks again!), it took me - more than five hours to download the dataset (in my work environment wired to the DFN network) - 20 min to execute the query - considerable time to fiddle with the Java code for the query if I had to adapt it (+ another 20 min to execute it again)
For many parts of the world, or even for users in Germany with a slow DSL connection, the first point alone would prohibit any use. And even with a good internet connection, a new or occasional user would quite probably turn away when offered this procedure instead of getting a "normal" LOD conformant query answered in a few seconds.
Again, I very much value your work and your determination to set up a service with very high availability and performance. Please, make the great Wikidata LOD available in less demanding settings, too. It should be possible for users to do more advanced SPARQL queries for LOD uses in an environment where you can not guarantee a high level of reliability.
Cheers, Joachim
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Neubert, Joachim Gesendet: Dienstag, 16. Februar 2016 15:48 An: 'Discussion list for the Wikidata project.' Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
Thanks Markus, I've created https://phabricator.wikimedia.org/T127070 with the details.
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Markus Krötzsch Gesendet: Dienstag, 16. Februar 2016 14:57 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
Hi Joachim,
I think SERVICE queries should be working, but maybe Stas knows more about this. Even if they are disabled, this should not result in some message rather than in a NullPointerException. Looks like a bug.
Markus
On 16.02.2016 13:56, Neubert, Joachim wrote:
Hi Markus,
Great that you checked that out. I can confirm that the simplified query worked for me, too. It took 15.6s and revealed roughly the same number of results (323789).
When I loaded the results into http://zbw.eu/beta/sparql/econ_pers/query, an endpoint for "economics-related" persons, it matched with 36050 persons (supposedly the "most important" 8 percent of our set).
What I normally would do to get the according Wikipedia site URLs, is a query against the wikidata endpoint, which references the relevant wikidata URIs via a "service" clause:
PREFIX skos: http://www.w3.org/2004/02/skos/core# PREFIX schema: http://schema.org/ # construct { ?gnd schema:about ?sitelink . } where { service http://zbw.eu/beta/sparql/econ_pers/query { ?gnd skos:prefLabel [] ; skos:exactMatch ?wd . filter(contains(str(?wd), 'wikidata')) } ?sitelink schema:about ?wd ; schema:inLanguage ?language . filter (contains(str(?sitelink), 'wikipedia')) filter (lang(?wdLabel) = ?language && ?language in ('en', 'de')) }
This however results in a java error.
If "service" clauses are supposed to work in the wikidata endpoint, I'd happily provide addtitional details in phabricator.
For now, I'll get the data via your java example code :)
Cheers, Joachim
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Markus Kroetzsch Gesendet: Samstag, 13. Februar 2016 22:56 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated
And here is another comment on this interesting topic :-)
I just realised how close the service is to answering the query. It turns out that you can in fact get the whole set of (currently >324000 result items) together with their GND identifiers as a download *within the timeout* (I tried several times without any errors). This is a 63M json result file with >640K individual values, and it downloads in no time on my home network. The query I use is simply this:
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/
select ?item ?gndId where { ?item wdt:P227 ?gndId ; # get gnd ID wdt:P31 wd:Q5 . # instance of human } ORDER BY ASC(?gndId) LIMIT 10
(don't run this in vain: even with the limit, the ORDER clause requires the service to compute all results every time someone runs this. Also be careful when removing the limit; your browser may hang on an HTML page that large; better use the SPARQL endpoint directly to download the complete result file.)
It seems that the timeout is only hit when adding more information (labels and wiki URLs) to the result.
So it seems that we are not actually very far away from being able to answer the original query even within the timeout. Certainly not as far away as I first thought. It might not be necessary at all to switch to a different approach (though it would be interesting to know how long LDF takes to answer the above -- our current service takes less than 10sec).
Cheers,
Markus
On 13.02.2016 11:40, Peter Haase wrote:
Hi,
you may want to check out the Linked Data Fragment server in Blazegraph: https://github.com/blazegraph/BlazegraphBasedTPFServer
Cheers, Peter
On 13.02.2016, at 01:33, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
The Linked data fragments approach Osma mentioned is very interesting (particularly the bit about setting it up on top of an regularily updated existing endpoint), and could provide another alternative, but I have not yet experimented with it.
There is apparently this: https://github.com/CristianCantoro/wikidataldf though not sure what it its status - I just found it.
In general, yes, I think checking out LDF may be a good idea. I'll put it on my todo list.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
Now, obviously endpoints referenced in a federated query via a service clause have to be open - so any attacker could send his queries directly instead of squeezing them through some other endpoint. The only scenario I can think of is that an attackers IP already is blocked by the attacked site. If (instead of much more common ways to fake an IP) the attacker would choose to do it by federated queries through WDQS, this _could_ result in WDQS being blocked by this endpoint.
This is not what we are concerned with. What we are concerned with is that federation essentially requires you to run an open proxy - i.e. to allow anybody to send requests to any URL. This is not acceptable to us because this means somebody could abuse this both to try and access our internal infrastructure and to launch attacks to other sites using our site as a platform.
We could allow, if there is enough demand, to access specific whitelisted endpoints but so far we haven't found any way to allow access to any SPARQL endpoint without essentially allowing anybody to launch arbitrary network connections from our server.
provide for the linked data cloud. This must not involve the highly-protected production environment, but could be solved by an additional unstable/experimental endpoint under another address.
The problem is we can not run production-quality endpoint in non-production environment. We could set up an endpoint on the Labs, but this endpoint would be underpowered and we won't be able to guarantee any quality of service there. To serve the amount of Wikidata data and updates, the machines should have certain hardware capabilities, which Labs machines currently do not have.
Additionally, I'm not sure running open proxy even there would be a good idea. Unfortunately, in the internet environment of today there is no lack of players that would want to abuse such thing for nefarious purposes.
We will keep looking for solution for this, but so far we haven't found one.
Thanks,
Hi Stas,
Thanks for your explanation! I've to perhaps do some tests on my own systems ...
Cheers, Joachim
-----Ursprüngliche Nachricht----- Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Stas Malyshev Gesendet: Donnerstag, 18. Februar 2016 19:12 An: Discussion list for the Wikidata project. Betreff: Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT results truncated
Hi!
Now, obviously endpoints referenced in a federated query via a service clause have to be open - so any attacker could send his queries directly instead of squeezing them through some other endpoint. The only scenario I can think of is that an attackers IP already is blocked by the attacked site. If (instead of much more common ways to fake an IP) the attacker would choose to do it by federated queries through WDQS, this _could_ result in WDQS being blocked by this endpoint.
This is not what we are concerned with. What we are concerned with is that federation essentially requires you to run an open proxy - i.e. to allow anybody to send requests to any URL. This is not acceptable to us because this means somebody could abuse this both to try and access our internal infrastructure and to launch attacks to other sites using our site as a platform.
We could allow, if there is enough demand, to access specific whitelisted endpoints but so far we haven't found any way to allow access to any SPARQL endpoint without essentially allowing anybody to launch arbitrary network connections from our server.
provide for the linked data cloud. This must not involve the highly-protected production environment, but could be solved by an additional unstable/experimental endpoint under another address.
The problem is we can not run production-quality endpoint in non-production environment. We could set up an endpoint on the Labs, but this endpoint would be underpowered and we won't be able to guarantee any quality of service there. To serve the amount of Wikidata data and updates, the machines should have certain hardware capabilities, which Labs machines currently do not have.
Additionally, I'm not sure running open proxy even there would be a good idea. Unfortunately, in the internet environment of today there is no lack of players that would want to abuse such thing for nefarious purposes.
We will keep looking for solution for this, but so far we haven't found one.
Thanks, -- Stas Malyshev smalyshev@wikimedia.org
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata