Here's a few thoughts about what may influence the data you're gathering.
The decision of whether a browser has sufficient support for our Grade A runtime happens client-side based on a combination of feature tests and (unfortunately) user-agent sniffing.
For this reason, our bootstrap script is written using only the most basic syntax and prototype methods (as any other methods would cause a run-time exception). For those familiar, this is somewhat similar to PHP version detection in MediaWiki. The file has to parse and run to a certain point in very old environments.
* stylesheets (e.g. ".css" requests as well as load.php?...&only=styles requests)
* images (e.g. ".png", ".svg" etc. as well as load.php?...&image=.. requests)
There are also non-MediaWiki environments (ab)using
bits.wikimedia.org and bypassing the startup module. As such these are loading javascript modules directly, regardless of browser. There are at least two of these that I know of:
1) Tool labs tools. Developers there may use
bits.wikimedia.org to serve modules like jQuery UI. They may circumvent the startup module and unconditionally load those (which will cause errors in older browsers, but they don't care or are unaware of how this works).
For the data to be as reliable as feasibly possible, one would want to filter out these "forged" requests not produced by MediaWiki. The best way to filter out requests that bypassed the startup module is to filter out requests with no version= query parameter. As well as request with an outdated version parameter (since they can copy an old url and hardcode it in their app).
Actually, there are probably about a dozen more exceptions I can think of. I don't believe it is feasibly possible to filter everything out. Perhaps focus your next data-gathering window on a specific payload url - instead of trying to catch all javascript payloads with exclusions for wrong ones.
For example, right now in MediaWiki 1.25wmf18 the jquery/mediawiki base payload has version 20150225T221331Z and is requested by the startup module from url (grabbed from the Network tab in Chrome Dev Tools):
Using only a specific url like that to gather user agents that support javascript will have considerably less false positives.
If you want to incorporate multiple wikis, it'll be a little more work to get all the right urls, but handpicking a dozen wikis will probably be good enough.
This also has the advantage of not being biased by devices cache size. Because, unlike all other modules, the base module is not cached in the LocalStorage. It will still benefit HTTP 304 caching however. It would help to have your window start simultaneously with the deployment of a new wmf branch to
en.wikipedia.org (and other wikis you include in the experiment) so there's a fresh start with caching.
</braindump>
— Timo
>
Do you think it's worth getting the UA distribution for CSS requests & correlate it with the distribution for page / JS loading?
I also like to do some research regarding IE6 /IE7 as we should see those (according to our code: https://github.com/wikimedia/mediawiki/blob/master/resources/src/startup.js) in the no JS list but we only see some UA agents there. There are definitely IE6/IE7 browsers to which we are serving javascript, just have to look in detail what is what we are serving there. Will report on this. Looks like this startup.js file is being served to all browsers regardless, so I might need to do some more fine grained queries.
Just consider the 3% as your approximate upper bound for overall traffic, big bots removed. If you just count mobile traffic, numbers in percentage are, of course, a lot higher.
Thanks,
Nuria
Gabriel:
I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful.
TL;DR
According to our new pageview definition (
https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7)
Thanks,
Nuria