On Wed, Nov 26, 2014 at 5:11 AM, Oliver Keyes <okeyes@wikimedia.org> wrote:

Gergo: Either the false anchors are sent to the server or some conniving elf has been inserting thousands of fake requests into our logs ;).

Probably not an elf but some sort of misbehaving client.

The base URI identifies a primary resource (the web page, in this case), the fragment id identifies a secondary one which might or might not be part of the primary. Retrieving the primary resource is the server's job; retrieving the secondary one based on the primary is the user agent's. So before the URi is dereferenced (retrieved), the user agent removes the fragment. This is officially described in RFC 3986 section 3.5; the W3C recommendation on web architecture has a more human-readable version ("Interpretation of the fragment identifier is performed solely by the agent that dereferences a URI; the fragment identifier is not passed to other systems during the process of retrieval.").

There will be some user agents which do not conform to the RFC (mostly search engines, I would expect, but maybe some unusual browsers as well), but I'm pretty sure major browsers do. So you should be aware that you are only filtering out a fraction of the MediaViewer requests.

We log an event (with sampling) when MediaViewer is loaded via an URL hash, and over all wikis it happens about 1M/day - see http://multimedia-metrics.wmflabs.org/graphs/mmv_actions_global , "hash load" at the bottom (replace "global" with DB name for a filtered view, raw data is in Schema:MediaViewer); you can use that to approximate how often such links are really followed.