Hey all,
As part of the quality assurance work on the new pageviews definition I've hand-coded 20,000 rows from the webrequests table that the new definition identifies as pageviews - 10,000 from mobile, and 10,000 from desktop, spread out and pseudo-randomly sampled over multiple days and hours.
TL;DR it looks really promising, but we need some expansion of how we're storing pageIDs, and to filter out edit attempts on the desktop side. And we still have a lot of handcoding to do.
On Mobile, the definition is doing exactly what we expect it to do, and including precisely the classes of pages we want. There is, seriously, a 100% success rate there. The only limiting factor is around turning "pageviews" (views of our HTML content) into the sort of pageviews that can be aggregated on a per-page basis - in other words, grabbing the pageID and namespace. A recent patch to MediaWiki by Ori, Otto and others means that the pageID and namespace are now automatically passed through to the varnish, which makes this a LOT easier.
Buuuut...they're not being passed through for app requests, which is a big blind spot if we assume apps behave differently. They're also not being passed through for, e.g., index.php?action=render style requests.
On Desktop, the definition is doing /almost/ what we want it to do. The big problem is that due to a change in the MIME type edit requests report with, it's including edit attempts: whoops. We should be able to filter this with a trivial regex change...I think. I'd need a better idea of whether URL parameters tend to be localised.
So, promising, needs edits filtered!
The next step is a further round of hand-coding, this time targeted at requests the new definition /excludes/.