On 8 January 2015 at 03:02, Gergo Tisza gtisza@wikimedia.org wrote:
On Wed, Jan 7, 2015 at 6:25 PM, Oliver Keyes okeyes@wikimedia.org wrote:
We get 120,000 requests a second. We're not storing them all for six months. But we do have sampled logs going back that far.
That would be great! Are those in Hadoop?
On Wed, Jan 7, 2015 at 11:36 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Not particularly, I don't think - except to remember that namespace names are localised, so you're going to have a whale of a time matching them (unless you just look for file endings, I guess).
In the case of NavigationTiming the nsid is recorded, so that wasn't a problem; but it has only been added around May, so for the period before that there is no namespace information at all.
Localized file namespace doesn't sound so bad - I can look up all translations in Translatewiki, and construct a regexp or a similar condition. There could be fun exceptions like namespace translations which have changed recently, but I would be fine with assuming the error caused by that is not significant.
Well, yes; a 750-option regex run over 6 million rows for a day of data. A whale of a time ;p. You can also just use the API's namespaceNames and namespaceAliases code.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics