On Thu, Jan 8, 2015 at 3:02 AM, Gergo Tisza gtisza@wikimedia.org wrote:
On Wed, Jan 7, 2015 at 6:25 PM, Oliver Keyes okeyes@wikimedia.org wrote:
We get 120,000 requests a second. We're not storing them all for six months. But we do have sampled logs going back that far.
That would be great! Are those in Hadoop?
They're on stat1002 in /a/squid/archive/sampled/
And the webrequest format is: https://wikitech.wikimedia.org/wiki/Cache_log_format
Note that the namespaces only show up in the title of the pages in the raw URL, so it's still going to be a bit painful to parse them out. But folks around here have done stuff like that, maybe someone can chime in with some handy scripts?