On 2 Mar 2015, at 00:35, Nuria Ruiz <nuria@wikimedia.org> wrote:

Thanks Timo for taking the time to write this. 



You're welcome. Thanks for this research. I'm excited about the results.



>There are also non-MediaWiki environments (ab)using bits.wikimedia.org and bypassing the startup module. As such these are loading javascript modules directly, regardless of browser. There are at least two of these that I know of:
I think our raw hive data probably does not includes the traffic from tools or wikipedia.org (need to confirm). But even if it did, the traffic of  tools on bits is not significant compared to the one from wikipedia thus does not affect the overall results as we are throwing away the longtail. Note that couple days worth of traffic might be more than a 1 billion requests for javascript on bits. 


Unless bits.wikimedia.org traffic statistics filters out things via the Referer header, I don't see how it could not include traffic triggered by Tool Labs and www-portals like www.wikipedia.org. They make script requests to bits.wikimedia.org.

But yeah, Tool Labs traffic will be tiny in comparison. I honestly have no clue how popular our www-portals are. I'd be interested in seeing some stats on that.



>Actually, there are probably about a dozen more exceptions I can think of. I don't believe it is feasibly possible to filter everything out. 
Statistically I do not think you need to, given the volume of traffic in wikipedia versus the other sources, you just cannot report results with a precision of, say, 0.001%. Even very small wikis - whose traffic is insignificant compared to english wikipedia- are also being thrown away. 

Point taken. Thanks :)