On 2 Mar 2015, at 00:35, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
Thanks Timo for taking the time to write this.
You're welcome. Thanks for this research. I'm excited about the results.
There are also non-MediaWiki environments
(ab)using
bits.wikimedia.org and bypassing the startup module. As such these are loading
javascript modules directly, regardless of browser. There are at least two of these that I
know of:
I think our raw hive data probably does not includes the traffic from
tools or
wikipedia.org (need to confirm). But even if it did, the traffic of tools on
bits is not significant compared to the one from wikipedia thus does not affect the
overall results as we are throwing away the longtail. Note that couple days worth of
traffic might be more than a 1 billion requests for javascript on bits.
Unless
bits.wikimedia.org traffic statistics filters out things via the Referer header, I
don't see how it could not include traffic triggered by Tool Labs and www-portals like
www.wikipedia.org. They make script requests to
bits.wikimedia.org.
But yeah, Tool Labs traffic will be tiny in comparison. I honestly have no clue how
popular our www-portals are. I'd be interested in seeing some stats on that.
Actually, there are probably about a dozen more
exceptions I can think of. I don't believe it is feasibly possible to filter
everything out.
Statistically I do not think you need to, given the volume of
traffic in wikipedia versus the other sources, you just cannot report results with a
precision of, say, 0.001%. Even very small wikis - whose traffic is insignificant compared
to english wikipedia- are also being thrown away.
Point taken. Thanks :)