On 28 December 2014 at 23:56, Jeremy Baron <jeremy(a)tuxmachine.com> wrote:
On Dec 28, 2014 11:35 PM, "Oliver Keyes"
<okeyes(a)wikimedia.org> wrote:
More importantly, the HTTPS protocol involves
either sanitising or
completely stripping referers, rendering those chains
impossible to
reconstruct.
Could you elaborate? (we're talking about hops from one page to another
within the same domain name?)
See, this is why I shouldn't reply to emails when I'm already in bed; I
end
up being wrong ;p. You're right - this shouldn't be a substantial issue
with this specific request (in-domain chains) although it is something
we're worried about for other types of chaining.
More generally: what is the status of hadoop? could we
potentially have
3rd-party users get access even if they can't do an NDA by writing their
own mapreduce jobs to support their research? Depending on the job maybe it
would need legal (LCA) review before releasing results or maybe some could
be reviewed by others (approved by LCA).
We could give researchers (all labs users?) access to a truly sanitized
dataset with the right format for use when designing
jobs. Or maybe not
sanitized but filtered to requests for just a few users that volunteered to
release their data for X days.
Hive/Hadoop is going pretty well, although we're running into some capacity
issues with day-to-day work that make me leery about opening it up widely
for random research requests. Toby can speak more usefully on what the
long-term plan is about making it available or including a cluster in labs
containing sanitised data.
Generally-speaking, though, I'd feel pretty uncomfortable with "give us
your job and we will run it", even with developer/researcher oversight and
LCA approval, because that's a lot of work to be put into each task. If
someone's work justifies gaining access to unsanitised, in-cluster data
(read: if there's clearly something in it that benefits the
community/movement/foundation enough to justify the developer input), it's
almost always going to be a lot simpler to just give them cluster access
and an NDA. At the moment that's not possible because we have a lot of work
and not enough spare capacity to justify routinely jumping on these
requests (by which I mean, if you doubled the team size, I think we'd have
some spare capacity somewhere, probably), and because even when we do have
that capacity, we don't yet have a firm routine and process for handling
NDAs (this is also something that's being worked on).
Even if/when such capacity exists, like I said, I think an NDA model is
going to be much preferable, and that's going to require a good
understanding of what the job is, who's executing it, what their expertise
is, what the intended purpose is, and what the benefits are, which haven't
been provided in this case: "can you give me X" should never be good enough.
Giving users access to sanitised data would be awesome (although like I
said, Toby is the person to comment on this kind of thing) but not viable
at the moment because we're still working on the sanitisation process. Data
from specific users who have opted in is problematic because, again, we
don't know who these "user" creatures are: there is no unique identifier we
could use to tag requests.
-Jeremy
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Oliver Keyes
Research Analyst
Wikimedia Foundation