(Hat-tip to Nemo for noticing this)
We're writing up and redefining the pageview definition. Amongst other things, it uses MIME type filtering and folder-level filtering to exclude non-pageviews.
Problem: the multimedia viewer false anchor (an example is https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcell... is both text/html (one of the recognised and appreciated MIME types) and /wiki/ (one of the recognised and appreciated domains).
The result of this is that it's currently going to be counted as a pageview, even though it's...well, not.
Is there any way you lot could avoid the false anchor strategy and pick a URL scheme that won't trigger this? If not, we can just write an exception - but I'd rather that we not have to do that every time anyone decides to make software.
Thanks,
Is it not a page view, or is it a page view of the image rather than the article?
If the media viewer is dismissed and the page is then read, is it still not a page view?
-- brion
On Tue, Nov 25, 2014 at 12:21 PM, Oliver Keyes okeyes@wikimedia.org wrote:
(Hat-tip to Nemo for noticing this)
We're writing up and redefining the pageview definition. Amongst other things, it uses MIME type filtering and folder-level filtering to exclude non-pageviews.
Problem: the multimedia viewer false anchor (an example is https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcell... is both text/html (one of the recognised and appreciated MIME types) and /wiki/ (one of the recognised and appreciated domains).
The result of this is that it's currently going to be counted as a pageview, even though it's...well, not.
Is there any way you lot could avoid the false anchor strategy and pick a URL scheme that won't trigger this? If not, we can just write an exception
- but I'd rather that we not have to do that every time anyone decides to
make software.
Thanks,
-- Oliver Keyes Research Analyst Wikimedia Foundation
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
I'm pretty sure this should be counted as a page view. As Brion notes this is the equivalent of a page view of the /wiki/File:Marcello_Malpighi_large.jpg page. If this doesn't count as a page view then mobile page views shouldn't count either as they are transformations of the canonical page as well.
Bryan
On Tue, Nov 25, 2014 at 1:38 PM, Brion Vibber bvibber@wikimedia.org wrote:
Is it not a page view, or is it a page view of the image rather than the article?
If the media viewer is dismissed and the page is then read, is it still not a page view?
-- brion
On Tue, Nov 25, 2014 at 12:21 PM, Oliver Keyes okeyes@wikimedia.org wrote:
(Hat-tip to Nemo for noticing this)
We're writing up and redefining the pageview definition. Amongst other things, it uses MIME type filtering and folder-level filtering to exclude non-pageviews.
Problem: the multimedia viewer false anchor (an example is https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcell... is both text/html (one of the recognised and appreciated MIME types) and /wiki/ (one of the recognised and appreciated domains).
The result of this is that it's currently going to be counted as a pageview, even though it's...well, not.
Is there any way you lot could avoid the false anchor strategy and pick a URL scheme that won't trigger this? If not, we can just write an exception - but I'd rather that we not have to do that every time anyone decides to make software.
Thanks,
-- Oliver Keyes Research Analyst Wikimedia Foundation
On Tue, Nov 25, 2014 at 12:52 PM, Bryan Davis bd808@wikimedia.org wrote:
I'm pretty sure this should be counted as a page view. As Brion notes this is the equivalent of a page view of the /wiki/File:Marcello_Malpighi_large.jpg page. If this doesn't count as a page view then mobile page views shouldn't count either as they are transformations of the canonical page as well.
AFAIK, the hash is generally not sent to the server so it (and thus the image name) won't be seen by log-based data crunching.
As such I think the issue Oliver is raising is that it would likely be counted as a page view for the article page, even though we don't know whether or not the article will actually be seen or read. However there's no way offhand to know that that page view won't result in a read of the article as well once the media viewer is dismissed. And in general, simply seeing that a data transfer was made tells us nothing about how much attention a human paid to the contents...
-- brion
Actually, I'd argue it's not equivalent at all, for two reasons:
1. it doesn't present all of the same data. In fact, it presents very little data, compared to a pageview of the "File" page; 2. The argument behind MMV is, as I understand it, that people are focusing on the images. It is designed so that people do so, on the basis that people clicking on images probably want those images. As such, it'd be inaccurate to weight it as equivalent to say https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcello_Malpighi_large.jpg in textual value - we believe (correct me if I'm wrong) that someone clicking for an image wants a media file, not a wall of text.
And, of course, even if we do include it, it's Yet Another URL Scheme to take into account when extracting "page" from "URL". I don't think mobile pageviews are a valid equivalency because our design pattern there does not assume a user has a !text intended outcome of the request.
On 25 November 2014 at 16:05, Brion Vibber bvibber@wikimedia.org wrote:
On Tue, Nov 25, 2014 at 12:52 PM, Bryan Davis bd808@wikimedia.org wrote:
I'm pretty sure this should be counted as a page view. As Brion notes this is the equivalent of a page view of the /wiki/File:Marcello_Malpighi_large.jpg page. If this doesn't count as a page view then mobile page views shouldn't count either as they are transformations of the canonical page as well.
AFAIK, the hash is generally not sent to the server so it (and thus the image name) won't be seen by log-based data crunching.
As such I think the issue Oliver is raising is that it would likely be counted as a page view for the article page, even though we don't know whether or not the article will actually be seen or read. However there's no way offhand to know that that page view won't result in a read of the article as well once the media viewer is dismissed. And in general, simply seeing that a data transfer was made tells us nothing about how much attention a human paid to the contents...
-- brion
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
I don't see changing the URL schema for the sake of PV measurement as on the table at this point. URLs to MMV invocations are shared, so we'd need to keep those links working anyway.
I do agree that we ideally want those to be countable separately (as part of an overall understanding of image views), but I don't think it's unreasonable or broken to count them as PVs for now until we can do so.
Erik
Fair point! In the meantime I'll add an exception for these requests, and factor them into the eventual image views thing.
On 25 November 2014 at 17:15, Erik Moeller erik@wikimedia.org wrote:
I don't see changing the URL schema for the sake of PV measurement as on the table at this point. URLs to MMV invocations are shared, so we'd need to keep those links working anyway.
I do agree that we ideally want those to be countable separately (as part of an overall understanding of image views), but I don't think it's unreasonable or broken to count them as PVs for now until we can do so.
Erik
-- Erik Möller VP of Product & Strategy, Wikimedia Foundation
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
On Tue, Nov 25, 2014 at 1:59 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Actually, I'd argue it's not equivalent at all, for two reasons:
- it doesn't present all of the same data. In fact, it presents very
little data, compared to a pageview of the "File" page; 2. The argument behind MMV is, as I understand it, that people are focusing on the images. It is designed so that people do so, on the basis that people clicking on images probably want those images. As such, it'd be inaccurate to weight it as equivalent to say https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcello_Malpighi_large.jpg in textual value - we believe (correct me if I'm wrong) that someone clicking for an image wants a media file, not a wall of text.
MediaViewer hash loads and File page requests have little to do with each other. File page request happens when 1) someone clicks on a thumbnail, 2) someone shares the URL of a file page and someone else follows that URL. In the case of MediaViewer, only the first case results in a text/html request to the server. The second case (which is about 30x more frequent) only results in a bunch of AJAX calls and an image request (actually more than one, due to preloading). Those AJAX calls could easily be made unique, if that is of any interest.
So basically when you click on an image, MediaViewer uses AJAX requests to load some of the information from the file page, then creates an <img> tag so the browser loads a large image thumbnail. When you visit an URL ending in #mediaviewer/..., that just tells the MV code to simulate an image click as soon as the page has loaded.
Server logs of page hits provide less and less value in terms of knowing what people are doing (was it ever possible to truly tell bots apart from humans? to compensate for caching proxies run by organizations?), the more client-side and mobile apps we develop. I think that it's inevitable that any meaningful tracking will have to be done client-side. Looking for ways to adapt our URL schemes for the sake of server logs seems like rearranging the deck chairs on the titanic to me. We should be trying to put as little work into it as possible. Our stats efforts should be rather focused on more fine-grained client-side and mobile tracking, which is what we need to truly answer questions, even on our old "static" pages like the articles themselves. The same way that I've been working on tracking how long images are being viewed for at the Amsterdam hackathon in preparation for Erik Zachte's RFC on image views, we should be doing the same sort of measurements on articles.
On Wed, Nov 26, 2014 at 12:51 AM, Gergo Tisza gtisza@wikimedia.org wrote:
On Tue, Nov 25, 2014 at 1:59 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Actually, I'd argue it's not equivalent at all, for two reasons:
- it doesn't present all of the same data. In fact, it presents very
little data, compared to a pageview of the "File" page; 2. The argument behind MMV is, as I understand it, that people are focusing on the images. It is designed so that people do so, on the basis that people clicking on images probably want those images. As such, it'd be inaccurate to weight it as equivalent to say https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcello_Malpighi_large.jpg in textual value - we believe (correct me if I'm wrong) that someone clicking for an image wants a media file, not a wall of text.
MediaViewer hash loads and File page requests have little to do with each other. File page request happens when 1) someone clicks on a thumbnail, 2) someone shares the URL of a file page and someone else follows that URL. In the case of MediaViewer, only the first case results in a text/html request to the server. The second case (which is about 30x more frequent) only results in a bunch of AJAX calls and an image request (actually more than one, due to preloading). Those AJAX calls could easily be made unique, if that is of any interest.
So basically when you click on an image, MediaViewer uses AJAX requests to load some of the information from the file page, then creates an <img> tag so the browser loads a large image thumbnail. When you visit an URL ending in #mediaviewer/..., that just tells the MV code to simulate an image click as soon as the page has loaded.
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
So, in sequence:
Gergo: Either the false anchors are sent to the server or some conniving elf has been inserting thousands of fake requests into our logs ;). I'm seeing a lot of requests with #mediaviewer/ URLs, some internal and some with referers from outside the WMF (implying someone following a link). The proposed ways forward are useful, but as Erik M says, reorganising active products for the sake of avoiding a pageviews filter is probably not worth it unless it's a truly trivial change, so let's just stick with the status quo for now and I'll build in a filter.
Gilles: see above, re Erik's comments.
Thanks to everyone for their commentary and help; I'll build a filter into the definition this morning :)
On 26 November 2014 at 05:07, Gilles Dubuc gilles@wikimedia.org wrote:
Server logs of page hits provide less and less value in terms of knowing what people are doing (was it ever possible to truly tell bots apart from humans? to compensate for caching proxies run by organizations?), the more client-side and mobile apps we develop. I think that it's inevitable that any meaningful tracking will have to be done client-side. Looking for ways to adapt our URL schemes for the sake of server logs seems like rearranging the deck chairs on the titanic to me. We should be trying to put as little work into it as possible. Our stats efforts should be rather focused on more fine-grained client-side and mobile tracking, which is what we need to truly answer questions, even on our old "static" pages like the articles themselves. The same way that I've been working on tracking how long images are being viewed for at the Amsterdam hackathon in preparation for Erik Zachte's RFC on image views, we should be doing the same sort of measurements on articles.
On Wed, Nov 26, 2014 at 12:51 AM, Gergo Tisza gtisza@wikimedia.org wrote:
On Tue, Nov 25, 2014 at 1:59 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Actually, I'd argue it's not equivalent at all, for two reasons:
- it doesn't present all of the same data. In fact, it presents
very little data, compared to a pageview of the "File" page; 2. The argument behind MMV is, as I understand it, that people are focusing on the images. It is designed so that people do so, on the basis that people clicking on images probably want those images. As such, it'd be inaccurate to weight it as equivalent to say https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi#mediaviewer/File:Marcello_Malpighi_large.jpg in textual value - we believe (correct me if I'm wrong) that someone clicking for an image wants a media file, not a wall of text.
MediaViewer hash loads and File page requests have little to do with each other. File page request happens when 1) someone clicks on a thumbnail, 2) someone shares the URL of a file page and someone else follows that URL. In the case of MediaViewer, only the first case results in a text/html request to the server. The second case (which is about 30x more frequent) only results in a bunch of AJAX calls and an image request (actually more than one, due to preloading). Those AJAX calls could easily be made unique, if that is of any interest.
So basically when you click on an image, MediaViewer uses AJAX requests to load some of the information from the file page, then creates an <img> tag so the browser loads a large image thumbnail. When you visit an URL ending in #mediaviewer/..., that just tells the MV code to simulate an image click as soon as the page has loaded.
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
On Wed, Nov 26, 2014 at 08:11:27AM -0500, Oliver Keyes wrote:
Gergo: Either the false anchors are sent to the server or some conniving elf has been inserting thousands of fake requests into our logs ;). I'm seeing a lot of requests with #mediaviewer/ URLs, some internal and some with referers from outside the WMF (implying someone following a link). The proposed ways forward are useful, but as Erik M says, reorganising active products for the sake of avoiding a pageviews filter is probably not worth it unless it's a truly trivial change, so let's just stick with the status quo for now and I'll build in a filter.
Given we're already in the realm of the impossible - Oliver, do any of the internal requests look like the Referer is *the same page* as the one the #mediaviewer link is for? If so, something is very wrong.
If not, something is only moderately wrong.
Not that I saw? But I only grabbed a small sample - I can grab a bigger one and send it over internally if you'd like. Many of them look like external links in or filtering mechanisms (based on the referer), but not all.
On 26 November 2014 at 08:33, Mark Holmquist mtraceur@member.fsf.org wrote:
On Wed, Nov 26, 2014 at 08:11:27AM -0500, Oliver Keyes wrote:
Gergo: Either the false anchors are sent to the server or some conniving elf has been inserting thousands of fake requests into our logs ;). I'm seeing a lot of requests with #mediaviewer/ URLs, some internal and some with referers from outside the WMF (implying someone following a link).
The
proposed ways forward are useful, but as Erik M says, reorganising active products for the sake of avoiding a pageviews filter is probably not
worth
it unless it's a truly trivial change, so let's just stick with the
status
quo for now and I'll build in a filter.
Given we're already in the realm of the impossible - Oliver, do any of the internal requests look like the Referer is *the same page* as the one the #mediaviewer link is for? If so, something is very wrong.
If not, something is only moderately wrong.
-- Mark Holmquist Software Engineer, Multimedia Wikimedia Foundation mtraceur@member.fsf.org https://wikimediafoundation.org/wiki/User:MHolmquist
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAEBAgAGBQJUddaPAAoJEEPl+wghkjzx5QIQAIZzPhnBCgFaGjFqT+Brdd/J 9EWZd6S+u7VG8pZMlYauq2Hn38YJ9WlyIC52c4Ge5NkWmF9khcs/THQs1VKSxnl1 h89GmO4S/eG5jj7y3V+dLdbWcXOim+20XtKS+UE72s2Q3VINowb7Bqjg7QHsAI9r 36sUd4g9mBPI0qOi5vzhQEPKuWmMiy0y93xLWSdVlE5AF8tKGsV3tER2Dz6MUgUj k20Vu2miSd2Uj3IBDRVI8gOOjqve09xtTpeVMJdUpTjV5ajvIGTjnHPDqHgqI3BF aVU3ehyxLYni6OvyHK8Cz3lVFcTlx/m9at0CrHgL0iHlVtbAp+yK09kps7aCPviC VluV1CMGsls8GL+0ADe2sQlxZBSMA+Udo9qwWzeA34FGjGDj7Cxj12nJtmzVu8tY 4uy5YPjIVp86q8SAq+0Zqg2QLZRkNnZJIgtu83DvxL5P4Uel1OVglwBkU5/AgGYG hmnviWP5NZX4ps+U7uEcINrRajGwRFkfuGSgR7Pc3uovX1NSTe7B4ccHO+y88Mca sm4O4WMAvXd+plwGPWVnHDtNjufipxDSO69V3X45njIeoOAZhUbheTRIpeV9v0ne Wl/PirN0EILpkccCbnweyLoJu8mJHYsX+dfROq0BwhLxTSupgTSItKm9McPVTdId 4qfkbStJvtGlyvaPX6YG =oImh -----END PGP SIGNATURE-----
Multimedia mailing list Multimedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/multimedia
On Wed, Nov 26, 2014 at 12:13:25PM -0500, Oliver Keyes wrote:
Not that I saw? But I only grabbed a small sample - I can grab a bigger one and send it over internally if you'd like. Many of them look like external links in or filtering mechanisms (based on the referer), but not all.
Nah, it sounds like only moderately wrong, so I won't worry about it.
Thanks,
On Wed, Nov 26, 2014 at 5:11 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Gergo: Either the false anchors are sent to the server or some conniving elf has been inserting thousands of fake requests into our logs ;).
Probably not an elf but some sort of misbehaving client.
The base URI identifies a primary resource (the web page, in this case), the fragment id identifies a secondary one which might or might not be part of the primary. Retrieving the primary resource is the server's job; retrieving the secondary one based on the primary is the user agent's. So before the URi is dereferenced (retrieved), the user agent removes the fragment. This is officially described in RFC 3986 section 3.5 https://tools.ietf.org/html/rfc3986#section-3.5; the W3C recommendation on web architecture has a more human-readable version http://www.w3.org/TR/webarch/#media-type-fragid ("Interpretation of the fragment identifier is performed solely by the agent that dereferences a URI; the fragment identifier is not passed to other systems during the process of retrieval.").
There will be some user agents which do not conform to the RFC (mostly search engines, I would expect, but maybe some unusual browsers as well), but I'm pretty sure major browsers do. So you should be aware that you are only filtering out a fraction of the MediaViewer requests.
We log an event (with sampling) when MediaViewer is loaded via an URL hash, and over all wikis it happens about 1M/day - see http://multimedia-metrics.wmflabs.org/graphs/mmv_actions_global , "hash load" at the bottom (replace "global" with DB name for a filtered view, raw data is in Schema:MediaViewer https://meta.wikimedia.org/wiki/Schema:MediaViewer); you can use that to approximate how often such links are really followed.
On Tue, Nov 25, 2014 at 12:21 PM, Oliver Keyes okeyes@wikimedia.org wrote:
We're writing up and redefining the pageview definition. Amongst other things, it uses MIME type filtering and folder-level filtering to exclude non-pageviews.
Just to clarify the context, we are talking about counting page views using some server side log (such as an Apache / Varnish request log), right? In that case there is absolutely no way to tell apart requests to MediaViewer and normal URLs - the fragment part of the URL is never sent to the server, so the browser basically just requests the normal page and then executes a bunch of Javascript.
This is intentional, as sending a request that is different from the request for the normal page would split the varnish cache and result in poor performance.
The result of this is that it's currently going to be counted as a
pageview, even though it's...well, not.
The aim of those URLs is to present an image in the context of the page; the user can access the page simply by closing the viewer. So it is not necessarily that different from a pageview. If you want detailed information about a request (did the user just want the intro section or the full article? Did they read any text at all?), you need client-side logging anyway, and in that case it is trivial to filter MediaViewer pageviews based on whether or not the user went back to the text.
Is there any way you lot could avoid the false anchor strategy and pick a
URL scheme that won't trigger this? If not, we can just write an exception
- but I'd rather that we not have to do that every time anyone decides to
make software.
We could add an exception to varnish to ignore a certain query parameter and treat, say, https://az.wikipedia.org/wiki/Mar%C3%A7ello_Malpigi *?mediaviewer*#mediaviewer/File:Marcello_Malpighi_large.jpg as a request to wiki/Mar%C3%A7ello_Malpigi and not wiki/Mar%C3%A7ello_Malpigi?mediaviewer and serve it from the same cache (or use something like ESI to the same effect). It would still mess up browser caching and any proxies on the client end, though.
Alternatively we could do something nasty like link to https://az.wikipedia.org/wiki/Special:MediaViewer/Mar%C3%A7ello_Malpigi#medi... and then have that special page do some sort of redirect, but that sounds rather horrible and still has some performance hit due to the extra request.
The nice and clean solution would be to have a kind of "landing page" which does not include the HTML of the wiki page at all, just maybe the set of images found on it, and load the page under it via AJAX when the user closes the lightbox. That would have advantages apart from stats (mostly performance, both server- and client-side), but it would be a major undertaking and not very well aligned with the current MediaWiki architecture, I think.
From the other end of the problem, we could log MediaViewer pageviews over
a separate channel so they can be subtracted from the pageview totals if needed (there are about a million URLs with MediaViewer hashes loaded per day, so that would not be a huge extra traffic), but I imagine maintaining such a dirty hack over complex pageview queries is not something you would wish upon yourselves.
So no, I don't really see any way around this (and also don't see how you could write an exception - as far as the server is aware, there is absolutely no difference between http://example.com and http://example.com#foo - that's kind of the point of not splitting browser cache), except in the very long term by shifting logging to the client. I suppose that has to happen eventually anyway, if we want to learn details like time spent on the page or heatmaps or whether the visitor scrolled to the bottom of the page.
multimedia@lists.wikimedia.org