Hi!
I was asked about getting access to query logs for Wikidata Query Service, for research purposes. So I'd like to start the discussion on it, specifically:
1. Can we do it at all - technically, legally, privacy-wise? (note we're talking about SPARQL query text only, no other information to be provided)
2. Are there any considerations why we may want *not* to do it even if we could?
3. How hard would it be to make such export and do we have any existing infrastructure that should be used for this?
All ideas/comments about providing (or not providing :) access to this data are welcome.
My immediate reaction is that queries might contain PII (Personally Identifyable Information), and thus would not be shareable. I'm open to other thoughts, of course.
Kevin Smith Agile Coach, Wikimedia Foundation
On Thu, Jan 14, 2016 at 12:37 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I was asked about getting access to query logs for Wikidata Query Service, for research purposes. So I'd like to start the discussion on it, specifically:
- Can we do it at all - technically, legally, privacy-wise? (note we're
talking about SPARQL query text only, no other information to be provided)
- Are there any considerations why we may want *not* to do it even if
we could?
- How hard would it be to make such export and do we have any existing
infrastructure that should be used for this?
All ideas/comments about providing (or not providing :) access to this data are welcome. -- Stas Malyshev smalyshev@wikimedia.org
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Hi!
My immediate reaction is that queries might contain PII (Personally Identifyable Information), and thus would not be shareable. I'm open to other thoughts, of course.
I though about it but I don't see how. Wikidata does not contain any PII and the only thing you can query for is what is in Wikidata. We're talking about SPARQL only, not the full log with IPs, browser fingerprints, etc. of course.
If I SPARQL search my address, what then? ;)
On 14 January 2016 at 12:51, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
My immediate reaction is that queries might contain PII (Personally Identifyable Information), and thus would not be shareable. I'm open to other thoughts, of course.
I though about it but I don't see how. Wikidata does not contain any PII and the only thing you can query for is what is in Wikidata. We're talking about SPARQL only, not the full log with IPs, browser fingerprints, etc. of course.
-- Stas Malyshev smalyshev@wikimedia.org
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Hi!
If I SPARQL search my address, what then? ;)
You get a very ugly Java exception :) Also, in the logs it won't be *your* address, it would be *an* address. The fact this address exists is already public. The fact that *you* live there is not, but SPARQL logs would not disclose that.
I see Kevin scooped me... but that's okay.
Even though you would expect SPARQL query logs to contain SPARQL queries, I wouldn't be shocked if there was other stuff in there.
Unlike generic search queries, you can validate that SPARQL queries are well formed and only share the well formed ones. I quickly found an online SPARQL validator; there's probably a repo somewhere on GitHub with one we could use. Just a thought.
How easy is it to encode/include PII in a valid SPARQL query? Hmmm.
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Thu, Jan 14, 2016 at 12:37 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I was asked about getting access to query logs for Wikidata Query Service, for research purposes. So I'd like to start the discussion on it, specifically:
- Can we do it at all - technically, legally, privacy-wise? (note we're
talking about SPARQL query text only, no other information to be provided)
- Are there any considerations why we may want *not* to do it even if
we could?
- How hard would it be to make such export and do we have any existing
infrastructure that should be used for this?
All ideas/comments about providing (or not providing :) access to this data are welcome. -- Stas Malyshev smalyshev@wikimedia.org
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
On Thu, Jan 14, 2016 at 12:52 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
How easy is it to encode/include PII in a valid SPARQL query? Hmmm.
Well, theoretically you can just type your home address into query form GUI and send it, but I don't think anybody does that :) At least not often.
But that wouldn't be valid SPARQL, so a validator could be used to filter out a lot of junk.
You should also refer to the Discovery data access guidelines Oliver has been working on. (On the Office wiki, so they are WMF internal at the moment.)
Hi!
You should also refer to the Discovery data access guidelines Oliver has been working on. (On the Office wiki, so they are WMF internal at the moment.)
Ah, that's a good suggestion - and maybe we could publish some of that? I'm not sure if mentioning hostnames etc. should be public (though they are probably on wikitech anyway) but at least guidelines seem to be something we could publish?
CC'ing our legal contact Stephen to help you think about this.
Please pull in Oliver as well as he's thought broadly about this for Discovery.
--tomasz
On Thu, Jan 14, 2016 at 12:37 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I was asked about getting access to query logs for Wikidata Query Service, for research purposes. So I'd like to start the discussion on it, specifically:
- Can we do it at all - technically, legally, privacy-wise? (note we're
talking about SPARQL query text only, no other information to be provided)
- Are there any considerations why we may want *not* to do it even if
we could?
- How hard would it be to make such export and do we have any existing
infrastructure that should be used for this?
All ideas/comments about providing (or not providing :) access to this data are welcome. -- Stas Malyshev smalyshev@wikimedia.org
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
For the benefit of those on this list, there was a follow-up IRC conversation, in which the outcome was, briefly:
SPARQL queries are user queries, and thus would be subject to the same rules as search queries. Which means that they are not released except under NDA and with good rationale.
I'm not a lawyer, so that is just my very high-level summary of the conversation.
Kevin Smith Agile Coach, Wikimedia Foundation
On Thu, Jan 14, 2016 at 1:47 PM, Tomasz Finc tfinc@wikimedia.org wrote:
CC'ing our legal contact Stephen to help you think about this.
Please pull in Oliver as well as he's thought broadly about this for Discovery.
--tomasz
On Thu, Jan 14, 2016 at 12:37 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I was asked about getting access to query logs for Wikidata Query Service, for research purposes. So I'd like to start the discussion on it, specifically:
- Can we do it at all - technically, legally, privacy-wise? (note we're
talking about SPARQL query text only, no other information to be
provided)
- Are there any considerations why we may want *not* to do it even if
we could?
- How hard would it be to make such export and do we have any existing
infrastructure that should be used for this?
All ideas/comments about providing (or not providing :) access to this data are welcome. -- Stas Malyshev smalyshev@wikimedia.org
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
The task that tracks our work to make our data access guidelines public is T123673 https://phabricator.wikimedia.org/T123673.
To be clear, this task is for making the guidelines themselves public, *not* the data. But, reasoning about making the data public is easier once the guidelines are themselves public. :-)
Dan
On 14 January 2016 at 13:52, Kevin Smith ksmith@wikimedia.org wrote:
For the benefit of those on this list, there was a follow-up IRC conversation, in which the outcome was, briefly:
SPARQL queries are user queries, and thus would be subject to the same rules as search queries. Which means that they are not released except under NDA and with good rationale.
I'm not a lawyer, so that is just my very high-level summary of the conversation.
Kevin Smith Agile Coach, Wikimedia Foundation
On Thu, Jan 14, 2016 at 1:47 PM, Tomasz Finc tfinc@wikimedia.org wrote:
CC'ing our legal contact Stephen to help you think about this.
Please pull in Oliver as well as he's thought broadly about this for Discovery.
--tomasz
On Thu, Jan 14, 2016 at 12:37 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I was asked about getting access to query logs for Wikidata Query Service, for research purposes. So I'd like to start the discussion on it, specifically:
- Can we do it at all - technically, legally, privacy-wise? (note we're
talking about SPARQL query text only, no other information to be
provided)
- Are there any considerations why we may want *not* to do it even if
we could?
- How hard would it be to make such export and do we have any existing
infrastructure that should be used for this?
All ideas/comments about providing (or not providing :) access to this data are welcome. -- Stas Malyshev smalyshev@wikimedia.org
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery