I think we are now all getting on the same wavelength.

The one piece of this puzzle that I am still missing is understanding how it seems like this traffic research for the Signpost was a surprise to Toby and he was thinking that it would benefit from Legal's input, because if the queries were being logged then I would have thought Toby would be aware of them because he would see them in the logs, and I would think that he and others would be regularly checking the logs to make sure that all accesses look normal. Toby, can you comment on that, and also clarify what part of this you are thinking will benefit from Legal's input?

Thanks,

Pine

This is an Encyclopedia
One gateway to the wide garden of knowledge, where lies
The deep rock of our past, in which we must delve
The well of our future,
The clear water we must leave untainted for those who come after us,
The fertile earth, in which truth may grow in bright places, tended by many hands,
And the broad fall of sunshine, warming our first steps toward knowing how much we do not know.
—Catherine Munro

On Mon, Oct 20, 2014 at 10:53 AM, Oliver Keyes <okeyes@wikimedia.org> wrote:

Makes sense. Yeah, I had a "assuming everyone knows what you know" moment; I appreciate the automated query logging may not be a known thing (for the reasons Jeremy sets out, it's currently accessible only via an internal proxy, which makes it a wee bit difficult for people to know that it exists ;p). Sorry about that.

We could probably do it via Hadoop (it'd be a lot easier to automate!) if we come up with some useful heuristics for what automated activity looks like. I'm hoping that the spider/bot/automation identification as part of the pageviews definition will give us some of that.

On 20 October 2014 13:50, Jeremy Baron <jeremy@tuxmachine.com> wrote:
On Oct 20, 2014 1:36 PM, "Oliver Keyes" <okeyes@wikimedia.org> wrote:
> I guess mostly I'm just confused as to what you'd add on top of "SSH keys, automated logging and transparent documentation".

I *think* Pine was asking for automatic query logging similar to what you've just said is already happening.

Eventually maybe we'll get these types of queries mostly running on hadoop+M/R. (vs. processing a local file on disk) We could publish public logs of M/R jobs and for some of them allow public download of the output. (but this particular query would not allow public downloading of the output because IP/UA string/etc.)

-Jeremy

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics