Hi,
Is the reader's click log data(should contain user id/ip, article title, timestamp) is available for Wikipedia.
with regards
Ditty
Afraid not. First, we do not have some of those datapoints; we do not currently have unique user IDs. And, second, it would be a tremendous ethical violation for us to release that data that we /do/ have (IP addresses, for example).
On 28 December 2014 at 21:00, Ditty Mathew dittyvkm@gmail.com wrote:
Hi,
Is the reader's click log data(should contain user id/ip, article title, timestamp) is available for Wikipedia.
with regards
Ditty
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
The exact user information is not needed. The anonymized data is enough. What exactly we need is the navigation path of Wikipedia readers.
with regards
Ditty
On Sun, Dec 28, 2014 at 9:46 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Afraid not. First, we do not have some of those datapoints; we do not currently have unique user IDs. And, second, it would be a tremendous ethical violation for us to release that data that we /do/ have (IP addresses, for example).
On 28 December 2014 at 21:00, Ditty Mathew dittyvkm@gmail.com wrote:
Hi,
Is the reader's click log data(should contain user id/ip, article title, timestamp) is available for Wikipedia.
with regards
Ditty
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
I'm not exactly sure how one provides an anonymised dataset that contains IP addresses. But:
We don't have those navigation paths and so can't provide them. Sure, we could provide the {referer, URL} tuples associated with specific IP addresses, and replace the IP with some kind of randomly-generated value (or just a salted hash) but this falls apart very quickly with the modern structure of the internet and the scale Wikimedia properties operate on: you can have a lot of distinct people at one IP address, particularly through cellular networks, and so multiple sessions and trails can get inaccurately grouped together. More importantly, the HTTPS protocol involves either sanitising or completely stripping referers, rendering those chains impossible to reconstruct.
I believe Leila Zia and Bob West (who will hopefully see this message. I know Leila is on this list!) are currently working on a project that looks at search paths, and they may have additional commentary. But generally-speaking: we do not generate this data as a matter of course, we would not be comfortable releasing it (unless exceedingly sanitised), and as the person who deals with our request logs on a day-to-day basis I can think of a half-dozen ways in which it would produce false results (ways we have no real way of checking the probability of occurring).
On 28 December 2014 at 22:53, Ditty Mathew dittyvkm@gmail.com wrote:
The exact user information is not needed. The anonymized data is enough. What exactly we need is the navigation path of Wikipedia readers.
with regards
Ditty
On Sun, Dec 28, 2014 at 9:46 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Afraid not. First, we do not have some of those datapoints; we do not currently have unique user IDs. And, second, it would be a tremendous ethical violation for us to release that data that we /do/ have (IP addresses, for example).
On 28 December 2014 at 21:00, Ditty Mathew dittyvkm@gmail.com wrote:
Hi,
Is the reader's click log data(should contain user id/ip, article title, timestamp) is available for Wikipedia.
with regards
Ditty
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Dec 28, 2014 11:35 PM, "Oliver Keyes" okeyes@wikimedia.org wrote:
More importantly, the HTTPS protocol involves either sanitising or
completely stripping referers, rendering those chains impossible to reconstruct.
Could you elaborate? (we're talking about hops from one page to another within the same domain name?)
More generally: what is the status of hadoop? could we potentially have 3rd-party users get access even if they can't do an NDA by writing their own mapreduce jobs to support their research? Depending on the job maybe it would need legal (LCA) review before releasing results or maybe some could be reviewed by others (approved by LCA).
We could give researchers (all labs users?) access to a truly sanitized dataset with the right format for use when designing jobs. Or maybe not sanitized but filtered to requests for just a few users that volunteered to release their data for X days.
-Jeremy
On 28 December 2014 at 23:56, Jeremy Baron jeremy@tuxmachine.com wrote:
On Dec 28, 2014 11:35 PM, "Oliver Keyes" okeyes@wikimedia.org wrote:
More importantly, the HTTPS protocol involves either sanitising or
completely stripping referers, rendering those chains impossible to reconstruct.
Could you elaborate? (we're talking about hops from one page to another within the same domain name?)
See, this is why I shouldn't reply to emails when I'm already in bed; I end up being wrong ;p. You're right - this shouldn't be a substantial issue with this specific request (in-domain chains) although it is something we're worried about for other types of chaining.
More generally: what is the status of hadoop? could we potentially have 3rd-party users get access even if they can't do an NDA by writing their own mapreduce jobs to support their research? Depending on the job maybe it would need legal (LCA) review before releasing results or maybe some could be reviewed by others (approved by LCA).
We could give researchers (all labs users?) access to a truly sanitized
dataset with the right format for use when designing jobs. Or maybe not sanitized but filtered to requests for just a few users that volunteered to release their data for X days.
Hive/Hadoop is going pretty well, although we're running into some capacity issues with day-to-day work that make me leery about opening it up widely for random research requests. Toby can speak more usefully on what the long-term plan is about making it available or including a cluster in labs containing sanitised data.
Generally-speaking, though, I'd feel pretty uncomfortable with "give us your job and we will run it", even with developer/researcher oversight and LCA approval, because that's a lot of work to be put into each task. If someone's work justifies gaining access to unsanitised, in-cluster data (read: if there's clearly something in it that benefits the community/movement/foundation enough to justify the developer input), it's almost always going to be a lot simpler to just give them cluster access and an NDA. At the moment that's not possible because we have a lot of work and not enough spare capacity to justify routinely jumping on these requests (by which I mean, if you doubled the team size, I think we'd have some spare capacity somewhere, probably), and because even when we do have that capacity, we don't yet have a firm routine and process for handling NDAs (this is also something that's being worked on).
Even if/when such capacity exists, like I said, I think an NDA model is going to be much preferable, and that's going to require a good understanding of what the job is, who's executing it, what their expertise is, what the intended purpose is, and what the benefits are, which haven't been provided in this case: "can you give me X" should never be good enough.
Giving users access to sanitised data would be awesome (although like I said, Toby is the person to comment on this kind of thing) but not viable at the moment because we're still working on the sanitisation process. Data from specific users who have opted in is problematic because, again, we don't know who these "user" creatures are: there is no unique identifier we could use to tag requests.
-Jeremy
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Sun, Dec 28, 2014 at 8:35 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I believe Leila Zia and Bob West (who will hopefully see this message. I
know Leila is on this list!) are currently working on a project that looks at search paths, and they may have additional commentary.
At the moment, we're looking at pageview traces, however, the work is at very early stages and the data collection itself is still subject to change. You can read about the current state of the project here https://meta.wikimedia.org/w/index.php?title=Research:Improving_link_coverage .
Ditty, do you have a project page for your research you can share?
Leila
On 28 December 2014 at 22:53, Ditty Mathew dittyvkm@gmail.com wrote:
The exact user information is not needed. The anonymized data is enough. What exactly we need is the navigation path of Wikipedia readers.
with regards
Ditty
On Sun, Dec 28, 2014 at 9:46 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Afraid not. First, we do not have some of those datapoints; we do not currently have unique user IDs. And, second, it would be a tremendous ethical violation for us to release that data that we /do/ have (IP addresses, for example).
On 28 December 2014 at 21:00, Ditty Mathew dittyvkm@gmail.com wrote:
Hi,
Is the reader's click log data(should contain user id/ip, article title, timestamp) is available for Wikipedia.
with regards
Ditty
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Leila,
As of now we don't have project page. Is the server access log mentioned in the project page is available??
with regards
Ditty
On Mon, Dec 29, 2014 at 7:29 PM, Leila Zia leila@wikimedia.org wrote:
On Sun, Dec 28, 2014 at 8:35 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I believe Leila Zia and Bob West (who will hopefully see this message. I
know Leila is on this list!) are currently working on a project that looks at search paths, and they may have additional commentary.
At the moment, we're looking at pageview traces, however, the work is at very early stages and the data collection itself is still subject to change. You can read about the current state of the project here https://meta.wikimedia.org/w/index.php?title=Research:Improving_link_coverage .
Ditty, do you have a project page for your research you can share?
Leila
On 28 December 2014 at 22:53, Ditty Mathew dittyvkm@gmail.com wrote:
The exact user information is not needed. The anonymized data is enough. What exactly we need is the navigation path of Wikipedia readers.
with regards
Ditty
On Sun, Dec 28, 2014 at 9:46 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Afraid not. First, we do not have some of those datapoints; we do not currently have unique user IDs. And, second, it would be a tremendous ethical violation for us to release that data that we /do/ have (IP addresses, for example).
On 28 December 2014 at 21:00, Ditty Mathew dittyvkm@gmail.com wrote:
Hi,
Is the reader's click log data(should contain user id/ip, article title, timestamp) is available for Wikipedia.
with regards
Ditty
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Mon, Dec 29, 2014 at 7:03 AM, Ditty Mathew dittyvkm@gmail.com wrote:
Hi Leila,
As of now we don't have project page.
Please share it with us once you have it, preferably on research meta so we can leave comments and brainstorm there.
Is the server access log mentioned in the project page is available??
Not publicly. The answers Oliver gave you apply to the logs.
Best, Leila
with regards
Ditty
On Mon, Dec 29, 2014 at 7:29 PM, Leila Zia leila@wikimedia.org wrote:
On Sun, Dec 28, 2014 at 8:35 PM, Oliver Keyes okeyes@wikimedia.org wrote:
I believe Leila Zia and Bob West (who will hopefully see this message. I
know Leila is on this list!) are currently working on a project that looks at search paths, and they may have additional commentary.
At the moment, we're looking at pageview traces, however, the work is at very early stages and the data collection itself is still subject to change. You can read about the current state of the project here https://meta.wikimedia.org/w/index.php?title=Research:Improving_link_coverage .
Ditty, do you have a project page for your research you can share?
Leila
On 28 December 2014 at 22:53, Ditty Mathew dittyvkm@gmail.com wrote:
The exact user information is not needed. The anonymized data is enough. What exactly we need is the navigation path of Wikipedia readers.
with regards
Ditty
On Sun, Dec 28, 2014 at 9:46 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Afraid not. First, we do not have some of those datapoints; we do not currently have unique user IDs. And, second, it would be a tremendous ethical violation for us to release that data that we /do/ have (IP addresses, for example).
On 28 December 2014 at 21:00, Ditty Mathew dittyvkm@gmail.com wrote:
Hi,
Is the reader's click log data(should contain user id/ip, article title, timestamp) is available for Wikipedia.
with regards
Ditty
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Oliver Keyes Research Analyst Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org