On Tue, Nov 25, 2014 at 12:21 PM, Oliver Keyes <okeyes@wikimedia.org> wrote:
We're writing up and redefining the pageview definition. Amongst other things, it uses MIME type filtering and folder-level filtering to exclude non-pageviews.

Just to clarify the context, we are talking about counting page views using some server side log (such as an Apache / Varnish request log), right? In that case there is absolutely no way to tell apart requests to MediaViewer and normal URLs - the fragment part of the URL is never sent to the server, so the browser basically just requests the normal page and then executes a bunch of Javascript.

This is intentional, as sending a request that is different from the request for the normal page would split the varnish cache and result in poor performance.

The result of this is that it's currently going to be counted as a pageview, even though it's...well, not.

The aim of those URLs is to present an image in the context of the page; the user can access the page simply by closing the viewer. So it is not necessarily that different from a pageview. If you want detailed information about a request (did the user just want the intro section or the full article? Did they read any text at all?), you need client-side logging anyway, and in that case it is trivial to filter MediaViewer pageviews based on whether or not the user went back to the text.

Is there any way you lot could avoid the false anchor strategy and pick a URL scheme that won't trigger this? If not, we can just write an exception - but I'd rather that we not have to do that every time anyone decides to make software.

We could add an exception to varnish to ignore a certain query parameter and treat, say, https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi?mediaviewer#mediaviewer/File:Marcello_Malpighi_large.jpg as a request to wiki/Mar%C3%A7ello_Malpigi and not wiki/Mar%C3%A7ello_Malpigi?mediaviewer and serve it from the same cache (or use something like ESI to the same effect). It would still mess up browser caching and any proxies on the client end, though.

Alternatively we could do something nasty like link to https://az.wikipedia.org/wiki/Special:MediaViewer/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcello_Malpighi_large.jpg and then have that special page do some sort of redirect, but that sounds rather horrible and still has some performance hit due to the extra request.

The nice and clean solution would be to have a kind of "landing page" which does not include the HTML of the wiki page at all, just maybe the set of images found on it, and load the page under it via AJAX when the user closes the lightbox. That would have advantages apart from stats (mostly performance, both server- and client-side), but it would be a major undertaking and not very well aligned with the current MediaWiki architecture, I think.

From the other end of the problem, we could log MediaViewer pageviews over a separate channel so they can be subtracted from the pageview totals if needed (there are about a million URLs with MediaViewer hashes loaded per day, so that would not be a huge extra traffic), but I imagine maintaining such a dirty hack over complex pageview queries is not something you would wish upon yourselves.

So no, I don't really see any way around this (and also don't see how you could write an exception - as far as the server is aware, there is absolutely no difference between http://example.com and http://example.com#foo - that's kind of the point of not splitting browser cache), except in the very long term by shifting logging to the client. I suppose that has to happen eventually anyway, if we want to learn details like time spent on the page or heatmaps or whether the visitor scrolled to the bottom of the page.