I think we are now all getting on the same wavelength.
The one piece of this puzzle that I am still missing is understanding how it seems like this traffic research for the Signpost was a surprise to Toby and he was thinking that it would benefit from Legal's input, because if the queries were being logged then I would have thought Toby would be aware of them because he would see them in the logs, and I would think that he and others would be regularly checking the logs to make sure that all accesses look normal. Toby, can you comment on that, and also clarify what part of this you are thinking will benefit from Legal's input?
Thanks,
Pine
*This is an Encyclopedia* https://www.wikipedia.org/
*One gateway to the wide garden of knowledge, where lies The deep rock of our past, in which we must delve The well of our future,The clear water we must leave untainted for those who come after us,The fertile earth, in which truth may grow in bright places, tended by many hands,And the broad fall of sunshine, warming our first steps toward knowing how much we do not know.*
*—Catherine Munro*
On Mon, Oct 20, 2014 at 10:53 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Makes sense. Yeah, I had a "assuming everyone knows what you know" moment; I appreciate the automated query logging may not be a known thing (for the reasons Jeremy sets out, it's currently accessible only via an internal proxy, which makes it a wee bit difficult for people to know that it exists ;p). Sorry about that.
We could probably do it via Hadoop (it'd be a lot easier to automate!) if we come up with some useful heuristics for what automated activity looks like. I'm hoping that the spider/bot/automation identification as part of the pageviews definition will give us some of that.
On 20 October 2014 13:50, Jeremy Baron jeremy@tuxmachine.com wrote:
On Oct 20, 2014 1:36 PM, "Oliver Keyes" okeyes@wikimedia.org wrote:
I guess mostly I'm just confused as to what you'd add on top of "SSH
keys, automated logging and transparent documentation".
I *think* Pine was asking for automatic query logging similar to what you've just said is already happening.
Eventually maybe we'll get these types of queries mostly running on hadoop+M/R. (vs. processing a local file on disk) We could publish public logs of M/R jobs and for some of them allow public download of the output. (but this particular query would not allow public downloading of the output because IP/UA string/etc.)
-Jeremy
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics