I just linked your results from https://phabricator.wikimedia.org/T58575, but really think that they should be more widely known. Do you mind writing a mail to wikitech@ or engineering@ about this finding?
Gabriel
On Sun, Mar 1, 2015 at 6:24 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Note that couple days worth of traffic might be more than a 1 billion
requests for javascript on bits. Sorry, correction. Couple days worth of "javascript bits" requests comes up to 100 million requests not a 1000 million.
On Sun, Mar 1, 2015 at 4:35 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Thanks Timo for taking the time to write this.
The following requests are not part of our primary javascript payload
and should be excluded when >interpreting bits.wikimedia.org requests for purposes of javascript "support": Correct. I think I excluded all those. Note that I listed on methodology "bits javascript traffic" not overall "bits traffic" https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript#Me...
I will double check the startup module just to be safe.
There are also non-MediaWiki environments (ab)using bits.wikimedia.org and
bypassing the startup module. As such these are loading javascript modules directly, regardless of browser. There are at least two of these that I know of: I think our raw hive data probably does not includes the traffic from tools or wikipedia.org (need to confirm). But even if it did, the traffic of tools on bits is not significant compared to the one from wikipedia thus does not affect the overall results as we are throwing away the longtail. Note that couple days worth of traffic might be more than a 1 billion requests for javascript on bits.
Actually, there are probably about a dozen more exceptions I can think
of. I don't believe it is feasibly possible to filter everything out. Statistically I do not think you need to, given the volume of traffic in wikipedia versus the other sources, you just cannot report results with a precision of, say, 0.001%. Even very small wikis - whose traffic is insignificant compared to english wikipedia- are also being thrown away. That is to say that if in the vasque wikipedia everyone started using "browser X" w/o Javascript support it will not be counted as it represents too small of a percentage of overall traffic. Results provided are an agreggation over all wikipedia's bits raw javascript traffic versus wikipedias overall pageviews. Because we are throwing away the long tail, results come from the most trafficked wikis (our disparity in pageviews among wikis is huge). If you want to get per wiki results you need to analyze the data in a completely different fashion.
On Sat, Feb 28, 2015 at 4:48 PM, Timo Tijhof ttijhof@wikimedia.org wrote:
Hi,
Here's a few thoughts about what may influence the data you're gathering.
The decision of whether a browser has sufficient support for our Grade A runtime happens client-side based on a combination of feature tests and (unfortunately) user-agent sniffing.
For this reason, our bootstrap script is written using only the most basic syntax and prototype methods (as any other methods would cause a run-time exception). For those familiar, this is somewhat similar to PHP version detection in MediaWiki. The file has to parse and run to a certain point in very old environments.
The following requests are not part of our primary javascript payload and should be excluded when interpreting bits.wikimedia.org requests for purposes of javascript "support":
- stylesheets (e.g. ".css" requests as well as load.php?...&only=styles
requests)
- images (e.g. ".png", ".svg" etc. as well as load.php?...&image=..
requests)
- favicons and apple-touch icons (e.g. bits.wikimedia.org/favicon/..,
bits.wikimedia.org/apple-touch/..)
- fonts (e.g. bits.wikimedia.org/static-../../fonts/..)
- events (e.g. bits.wikimedia.org/event.gif, bits.wikimedia.org/statsv)
- startup module (bits.wikimedia.org/../load.php?..modules=startup)
There are also non-MediaWiki environments (ab)using bits.wikimedia.org and bypassing the startup module. As such these are loading javascript modules directly, regardless of browser. There are at least two of these that I know of:
- Tool labs tools. Developers there may use bits.wikimedia.org to
serve modules like jQuery UI. They may circumvent the startup module and unconditionally load those (which will cause errors in older browsers, but they don't care or are unaware of how this works).
- Portals such as www.wikipedia.org and others.
For the data to be as reliable as feasibly possible, one would want to filter out these "forged" requests not produced by MediaWiki. The best way to filter out requests that bypassed the startup module is to filter out requests with no version= query parameter. As well as request with an outdated version parameter (since they can copy an old url and hardcode it in their app).
Actually, there are probably about a dozen more exceptions I can think of. I don't believe it is feasibly possible to filter everything out. Perhaps focus your next data-gathering window on a specific payload url - instead of trying to catch all javascript payloads with exclusions for wrong ones.
For example, right now in MediaWiki 1.25wmf18 the jquery/mediawiki base payload has version 20150225T221331Z and is requested by the startup module from url (grabbed from the Network tab in Chrome Dev Tools):
https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en...
Using only a specific url like that to gather user agents that support javascript will have considerably less false positives.
If you want to incorporate multiple wikis, it'll be a little more work to get all the right urls, but handpicking a dozen wikis will probably be good enough.
This also has the advantage of not being biased by devices cache size. Because, unlike all other modules, the base module is not cached in the LocalStorage. It will still benefit HTTP 304 caching however. It would help to have your window start simultaneously with the deployment of a new wmf branch to en.wikipedia.org (and other wikis you include in the experiment) so there's a fresh start with caching.
</braindump>
— Timo
On 18 Feb 2015, at 18:07, Nuria Ruiz nuria@wikimedia.org wrote:
Do you think it's worth getting the UA distribution for CSS requests
& correlate it with the distribution for page / JS loading? Yes, we can do that. I would need to gather a new dataset for it so I've made a new task for it (https://phabricator.wikimedia.org/T89847), marking this one as complete: https://phabricator.wikimedia.org/T88560
I also like to do some research regarding IE6 /IE7 as we should see those (according to our code: https://github.com/wikimedia/mediawiki/blob/master/resources/src/startup.js) in the no JS list but we only see some UA agents there. There are definitely IE6/IE7 browsers to which we are serving javascript, just have to look in detail what is what we are serving there. Will report on this. Looks like this startup.js file is being served to all browsers regardless, so I might need to do some more fine grained queries.
Just consider the 3% as your approximate upper bound for overall traffic, big bots removed. If you just count mobile traffic, numbers in percentage are, of course, a lot higher.
Thanks,
Nuria
On 17 Feb 2015, at 03:38, Nuria Ruiz nuria@wikimedia.org wrote:
Gabriel:
I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful.
TL;DR According to our new pageview definition ( https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7) https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria