On Wed, Nov 26, 2014 at 5:11 AM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
Gergo: Either the false anchors are sent to the server
or some conniving
elf has been inserting thousands of fake requests into our logs ;).
Probably not an elf but some sort of misbehaving client.
The base URI identifies a primary resource (the web page, in this case),
the fragment id identifies a secondary one which might or might not be part
of the primary. Retrieving the primary resource is the server's job;
retrieving the secondary one based on the primary is the user agent's. So
before the URi is dereferenced (retrieved), the user agent removes the
fragment. This is officially described in RFC 3986 section 3.5
<https://tools.ietf.org/html/rfc3986#section-3.5>; the W3C recommendation
on web architecture has a more human-readable version
<http://www.w3.org/TR/webarch/#media-type-fragid> ("Interpretation of the
fragment identifier is performed solely by the agent that dereferences a
URI; the fragment identifier is not passed to other systems during the
process of retrieval.").
There will be some user agents which do not conform to the RFC (mostly
search engines, I would expect, but maybe some unusual browsers as well),
but I'm pretty sure major browsers do. So you should be aware that you are
only filtering out a fraction of the MediaViewer requests.
We log an event (with sampling) when MediaViewer is loaded via an URL hash,
and over all wikis it happens about 1M/day - see
http://multimedia-metrics.wmflabs.org/graphs/mmv_actions_global , "hash
load" at the bottom (replace "global" with DB name for a filtered view,
raw
data is in Schema:MediaViewer
<https://meta.wikimedia.org/wiki/Schema:MediaViewer>); you can use that to
approximate how often such links are really followed.