On Thu, Jan 8, 2015 at 3:02 AM, Gergo Tisza <gtisza@wikimedia.org> wrote:

On Wed, Jan 7, 2015 at 6:25 PM, Oliver Keyes <okeyes@wikimedia.org> wrote:
We get 120,000 requests a second. We're not storing them all for six
months. But we do have sampled logs going back that far.

That would be great! Are those in Hadoop?

They're on stat1002 in /a/squid/archive/sampled/

And the webrequest format is: https://wikitech.wikimedia.org/wiki/Cache_log_format

Note that the namespaces only show up in the title of the pages in the raw URL, so it's still going to be a bit painful to parse them out. But folks around here have done stuff like that, maybe someone can chime in with some handy scripts?