On Wed, Jan 7, 2015 at 6:25 PM, Oliver Keyes <okeyes@wikimedia.org> wrote:
We get 120,000 requests a second. We're not storing them all for six
months. But we do have sampled logs going back that far.

That would be great! Are those in Hadoop?

On Wed, Jan 7, 2015 at 11:36 PM, Oliver Keyes <okeyes@wikimedia.org> wrote:
Not particularly, I don't think - except to remember that namespace
names are localised, so you're going to have a whale of a time
matching them (unless you just look for file endings, I guess).

In the case of NavigationTiming the nsid is recorded, so that wasn't a problem; but it has only been added around May, so for the period before that there is no namespace information at all.

Localized file namespace doesn't sound so bad - I can look up all translations in Translatewiki, and construct a regexp or a similar condition. There could be fun exceptions like namespace translations which have changed recently, but I would be fine with assuming the error caused by that is not significant.