Looking for reader's click log data for Wikipedia

List overview All Threads
Download

newer

older

Fwd: $55 million raised in 2014

Re: [Wiki-research-l] Looking for...

Ditty Mathew

29 Dec 2014 29 Dec '14

3 a.m.

Hi, Is the reader's click log data(should contain user id/ip, article title, timestamp) is available for Wikipedia. with regards Ditty

Attachments:

attachment.htm (text/html — 228 bytes)

Show replies by date

Oliver Keyes

29 Dec 29 Dec

3:46 a.m.

New subject: Looking for reader's click log data for Wikipedia

...

Hi, Is the reader's click log data(should contain user id/ip, article title, timestamp) is available for Wikipedia. with regards Ditty _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Oliver Keyes Research Analyst Wikimedia Foundation

Ditty Mathew

4:53 a.m.

New subject: Looking for reader's click log data for Wikipedia

...

-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Oliver Keyes

5:35 a.m.

New subject: Looking for reader's click log data for Wikipedia

I'm not exactly sure how one provides an anonymised dataset that contains IP addresses. But: We don't have those navigation paths and so can't provide them. Sure, we could provide the {referer, URL} tuples associated with specific IP addresses, and replace the IP with some kind of randomly-generated value (or just a salted hash) but this falls apart very quickly with the modern structure of the internet and the scale Wikimedia properties operate on: you can have a lot of distinct people at one IP address, particularly through cellular networks, and so multiple sessions and trails can get inaccurately grouped together. More importantly, the HTTPS protocol involves either sanitising or completely stripping referers, rendering those chains impossible to reconstruct. I believe Leila Zia and Bob West (who will hopefully see this message. I know Leila is on this list!) are currently working on a project that looks at search paths, and they may have additional commentary. But generally-speaking: we do not generate this data as a matter of course, we would not be comfortable releasing it (unless exceedingly sanitised), and as the person who deals with our request logs on a day-to-day basis I can think of a half-dozen ways in which it would produce false results (ways we have no real way of checking the probability of occurring). On 28 December 2014 at 22:53, Ditty Mathew <dittyvkm(a)gmail.com> wrote:

...

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Oliver Keyes Research Analyst Wikimedia Foundation

Jeremy Baron

5:56 a.m.

New subject: Looking for reader's click log data for Wikipedia

On Dec 28, 2014 11:35 PM, "Oliver Keyes" <okeyes(a)wikimedia.org> wrote:

...

More importantly, the HTTPS protocol involves either sanitising or

completely stripping referers, rendering those chains impossible to reconstruct. Could you elaborate? (we're talking about hops from one page to another within the same domain name?) More generally: what is the status of hadoop? could we potentially have 3rd-party users get access even if they can't do an NDA by writing their own mapreduce jobs to support their research? Depending on the job maybe it would need legal (LCA) review before releasing results or maybe some could be reviewed by others (approved by LCA). We could give researchers (all labs users?) access to a truly sanitized dataset with the right format for use when designing jobs. Or maybe not sanitized but filtered to requests for just a few users that volunteered to release their data for X days. -Jeremy

Oliver Keyes

6:08 a.m.

New subject: Looking for reader's click log data for Wikipedia

On 28 December 2014 at 23:56, Jeremy Baron <jeremy(a)tuxmachine.com> wrote:

...

On Dec 28, 2014 11:35 PM, "Oliver Keyes" <okeyes(a)wikimedia.org> wrote:

More importantly, the HTTPS protocol involves either sanitising or

completely stripping referers, rendering those chains impossible to reconstruct. Could you elaborate? (we're talking about hops from one page to another within the same domain name?)

See, this is why I shouldn't reply to emails when I'm already in bed; I end up being wrong ;p. You're right - this shouldn't be a substantial issue with this specific request (in-domain chains) although it is something we're worried about for other types of chaining.

...

More generally: what is the status of hadoop? could we potentially have 3rd-party users get access even if they can't do an NDA by writing their own mapreduce jobs to support their research? Depending on the job maybe it would need legal (LCA) review before releasing results or maybe some could be reviewed by others (approved by LCA).

We could give researchers (all labs users?) access to a truly sanitized

...

dataset with the right format for use when designing jobs. Or maybe not sanitized but filtered to requests for just a few users that volunteered to release their data for X days.

Hive/Hadoop is going pretty well, although we're running into some capacity issues with day-to-day work that make me leery about opening it up widely for random research requests. Toby can speak more usefully on what the long-term plan is about making it available or including a cluster in labs containing sanitised data. Generally-speaking, though, I'd feel pretty uncomfortable with "give us your job and we will run it", even with developer/researcher oversight and LCA approval, because that's a lot of work to be put into each task. If someone's work justifies gaining access to unsanitised, in-cluster data (read: if there's clearly something in it that benefits the community/movement/foundation enough to justify the developer input), it's almost always going to be a lot simpler to just give them cluster access and an NDA. At the moment that's not possible because we have a lot of work and not enough spare capacity to justify routinely jumping on these requests (by which I mean, if you doubled the team size, I think we'd have some spare capacity somewhere, probably), and because even when we do have that capacity, we don't yet have a firm routine and process for handling NDAs (this is also something that's being worked on). Even if/when such capacity exists, like I said, I think an NDA model is going to be much preferable, and that's going to require a good understanding of what the job is, who's executing it, what their expertise is, what the intended purpose is, and what the benefits are, which haven't been provided in this case: "can you give me X" should never be good enough. Giving users access to sanitised data would be awesome (although like I said, Toby is the person to comment on this kind of thing) but not viable at the moment because we're still working on the sanitisation process. Data from specific users who have opted in is problematic because, again, we don't know who these "user" creatures are: there is no unique identifier we could use to tag requests.

...

-Jeremy _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- Oliver Keyes Research Analyst Wikimedia Foundation

Leila Zia

2:59 p.m.

New subject: Looking for reader's click log data for Wikipedia

On Sun, Dec 28, 2014 at 8:35 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote: I believe Leila Zia and Bob West (who will hopefully see this message. I

...

know Leila is on this list!) are currently working on a project that looks at search paths, and they may have additional commentary.

At the moment, we're looking at pageview traces, however, the work is at very early stages and the data collection itself is still subject to change. You can read about the current state of the project here <https://meta.wikimedia.org/w/index.php?title=Research:Improving_link_coverage> . Ditty, do you have a project page for your research you can share? Leila

...

On 28 December 2014 at 22:53, Ditty Mathew <dittyvkm(a)gmail.com> wrote:

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Ditty Mathew

4:03 p.m.

New subject: Looking for reader's click log data for Wikipedia

Hi Leila, As of now we don't have project page. Is the server access log mentioned in the project page is available?? with regards Ditty On Mon, Dec 29, 2014 at 7:29 PM, Leila Zia <leila(a)wikimedia.org> wrote:

...

On Sun, Dec 28, 2014 at 8:35 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote: I believe Leila Zia and Bob West (who will hopefully see this message. I

know Leila is on this list!) are currently working on a project that looks at search paths, and they may have additional commentary.

On 28 December 2014 at 22:53, Ditty Mathew <dittyvkm(a)gmail.com> wrote:

Afraid not. First, we do not have some of those datapoints; we do not currently have unique user IDs. And, second, it would be a tremendous ethical violation for us to release that data that we /do/ have (IP addresses, for example). On 28 December 2014 at 21:00, Ditty Mathew <dittyvkm(a)gmail.com> wrote: > Hi, > > Is the reader's click log data(should contain user id/ip, article > title, timestamp) is available for Wikipedia. > > with regards > > Ditty > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Leila Zia

4:18 p.m.

New subject: Looking for reader's click log data for Wikipedia

On Mon, Dec 29, 2014 at 7:03 AM, Ditty Mathew <dittyvkm(a)gmail.com> wrote:

...

Hi Leila, As of now we don't have project page.

Please share it with us once you have it, preferably on research meta so we can leave comments and brainstorm there.

...

Is the server access log mentioned in the project page is available??

Not publicly. The answers Oliver gave you apply to the logs. Best, Leila

...

with regards Ditty On Mon, Dec 29, 2014 at 7:29 PM, Leila Zia <leila(a)wikimedia.org> wrote:

On Sun, Dec 28, 2014 at 8:35 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote: I believe Leila Zia and Bob West (who will hopefully see this message. I

know Leila is on this list!) are currently working on a project that looks at search paths, and they may have additional commentary.

On 28 December 2014 at 22:53, Ditty Mathew <dittyvkm(a)gmail.com> wrote:

The exact user information is not needed. The anonymized data is enough. What exactly we need is the navigation path of Wikipedia readers. with regards Ditty On Sun, Dec 28, 2014 at 9:46 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote: > Afraid not. First, we do not have some of those datapoints; we do not > currently have unique user IDs. And, second, it would be a tremendous > ethical violation for us to release that data that we /do/ have (IP > addresses, for example). > > On 28 December 2014 at 21:00, Ditty Mathew <dittyvkm(a)gmail.com> wrote: > >> Hi, >> >> Is the reader's click log data(should contain user id/ip, article >> title, timestamp) is available for Wikipedia. >> >> with regards >> >> Ditty >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

3406

days inactive

3406

days old

wiki-research-l@lists.wikimedia.org

Manage subscription

8 comments

4 participants

tags (0)

participants (4)

Ditty Mathew
Jeremy Baron
Leila Zia
Oliver Keyes