Hey all,
As part of the quality assurance work on the new pageviews definition
I've hand-coded 20,000 rows from the webrequests table that the new
definition identifies as pageviews - 10,000 from mobile, and 10,000
from desktop, spread out and pseudo-randomly sampled over multiple
days and hours.
TL;DR it looks really promising, but we need some expansion of how
we're storing pageIDs, and to filter out edit attempts on the desktop
side. And we still have a lot of handcoding to do.
On Mobile, the definition is doing exactly what we expect it to do,
and including precisely the classes of pages we want. There is,
seriously, a 100% success rate there. The only limiting factor is
around turning "pageviews" (views of our HTML content) into the sort
of pageviews that can be aggregated on a per-page basis - in other
words, grabbing the pageID and namespace. A recent patch to MediaWiki
by Ori, Otto and others means that the pageID and namespace are now
automatically passed through to the varnish, which makes this a LOT
easier.
Buuuut...they're not being passed through for app requests, which is a
big blind spot if we assume apps behave differently. They're also not
being passed through for, e.g., index.php?action=render style
requests.
On Desktop, the definition is doing /almost/ what we want it to do.
The big problem is that due to a change in the MIME type edit requests
report with, it's including edit attempts: whoops. We should be able
to filter this with a trivial regex change...I think. I'd need a
better idea of whether URL parameters tend to be localised.
So, promising, needs edits filtered!
The next step is a further round of hand-coding, this time targeted at
requests the new definition /excludes/.
--
Oliver Keyes
Research Analyst
Wikimedia Foundation