Gabriel:
I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful.
TL;DR According to our new pageview definition ( https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7) https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria
Thank you, Nuria!
That number is higher than I expected given that the general web was apparently closer to 1.3% in 2010 http://stackoverflow.com/questions/9478737/browser-statistics-on-javascript-disabled. Do you think there are ways to fine-tune this, perhaps by excluding clients that also didn't download images?
Finally, is there a way to gauge the difference in JS support between anonymous & authenticated users from this data?
Gabriel
On Mon, Feb 16, 2015 at 6:38 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Gabriel:
I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful.
TL;DR According to our new pageview definition ( https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7) https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria
That number is higher than I expected given that the general web was
apparently closer to 1.3% in 2010 http://stackoverflow.com/questions/9478737/browser-statistics-on-javascript-disabled . mmm.. that study looks too old to be relevant, two things on that:
*1) Numbers from 2010 do not include mobile browsers with widespread use nowadays. * For example: "Opera Mini". We have >1% of requests only from this browser (and I bet than in 2015 Yahoo is seeing quite a few of those). Note that this 1% is a more precise one, derived directly from hadoop logs, requires no guesswork.
So it is not surprising that the number of disabled javascript pageviews has gone up if you take mobile into account. Opera Mini does not support javascript in the ways you would expect: https://dev.opera.com/articles/opera-mini-and-javascript/
*2) Our data differs from global stats in significant ways. * For example, our IE6 and IE7 traffic is way higher than global stats reported by http://gs.statcounter.com/ on the month of January. And note these browser percentages are more precise estimates on our end (unlike the javascript estimate that requires some cross checking and guesswork). Also, note the total percentage we report over pageviews includes bots so excluding those our IE6 and IE7 traffic is even higher than the one I am noting below.
Browser, Percentage of total pageviews by our account, global percentage by statscounter IE6: 1.01%, 0.09% IE7: 0.7% , 0.14%
I do not expect that our numbers are going to match 100% to statscounter but I think is an OK guide to cross-check oneself, especially cause they deploy their beacons worldwide:http://gs.statcounter.com/faq#methodology
Finally, is there a way to gauge the difference in JS support between
anonymous & authenticated users from this data? No, I do not think we can do that with this dataset.
On Mon, Feb 16, 2015 at 7:19 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Thank you, Nuria!
That number is higher than I expected given that the general web was apparently closer to 1.3% in 2010 http://stackoverflow.com/questions/9478737/browser-statistics-on-javascript-disabled. Do you think there are ways to fine-tune this, perhaps by excluding clients that also didn't download images?
Finally, is there a way to gauge the difference in JS support between anonymous & authenticated users from this data?
Gabriel
On Mon, Feb 16, 2015 at 6:38 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Gabriel:
I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful.
TL;DR According to our new pageview definition ( https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7) https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria
Sorry I forgot to address this earlier:
Do you think there are ways to fine-tune this, perhaps by excluding
clients that also didn't download images? We can look at that but I suspect results will not differ much. Let me know if you think is necessary.
On Mon, Feb 16, 2015 at 9:23 PM, Nuria Ruiz nuria@wikimedia.org wrote:
That number is higher than I expected given that the general web was
apparently closer to 1.3% in 2010 http://stackoverflow.com/questions/9478737/browser-statistics-on-javascript-disabled . mmm.. that study looks too old to be relevant, two things on that:
*1) Numbers from 2010 do not include mobile browsers with widespread use nowadays. * For example: "Opera Mini". We have >1% of requests only from this browser (and I bet than in 2015 Yahoo is seeing quite a few of those). Note that this 1% is a more precise one, derived directly from hadoop logs, requires no guesswork.
So it is not surprising that the number of disabled javascript pageviews has gone up if you take mobile into account. Opera Mini does not support javascript in the ways you would expect: https://dev.opera.com/articles/opera-mini-and-javascript/
*2) Our data differs from global stats in significant ways. * For example, our IE6 and IE7 traffic is way higher than global stats reported by http://gs.statcounter.com/ on the month of January. And note these browser percentages are more precise estimates on our end (unlike the javascript estimate that requires some cross checking and guesswork). Also, note the total percentage we report over pageviews includes bots so excluding those our IE6 and IE7 traffic is even higher than the one I am noting below.
Browser, Percentage of total pageviews by our account, global percentage by statscounter IE6: 1.01%, 0.09% IE7: 0.7% , 0.14%
I do not expect that our numbers are going to match 100% to statscounter but I think is an OK guide to cross-check oneself, especially cause they deploy their beacons worldwide:http://gs.statcounter.com/faq#methodology
Finally, is there a way to gauge the difference in JS support between
anonymous & authenticated users from this data? No, I do not think we can do that with this dataset.
On Mon, Feb 16, 2015 at 7:19 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Thank you, Nuria!
That number is higher than I expected given that the general web was apparently closer to 1.3% in 2010 http://stackoverflow.com/questions/9478737/browser-statistics-on-javascript-disabled. Do you think there are ways to fine-tune this, perhaps by excluding clients that also didn't download images?
Finally, is there a way to gauge the difference in JS support between anonymous & authenticated users from this data?
Gabriel
On Mon, Feb 16, 2015 at 6:38 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Gabriel:
I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful.
TL;DR According to our new pageview definition ( https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7) https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria
Nuria,
your explanation for the increase of the no-JS numbers makes a lot of sense to me. This is very valuable information, as it contradicts the assumption that JS support kept going up in the meantime. Thank you!
The main thing I'm slightly worried about with relatively small total numbers is that even one or two bots masquerading as old browser UAs could skew the results for those UAs. This matters especially once we are trying to establish trends based on this early measurement. Using additional behavioral factors like image (or, perhaps better, CSS) loading might help to more precisely weed out non-browser users, which could benefit our UA detection precision in general. Do you think it's worth getting the UA distribution for CSS requests & correlate it with the distribution for page / JS loading?
Gabriel
On Wed, Feb 18, 2015 at 7:17 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Sorry I forgot to address this earlier:
Do you think there are ways to fine-tune this, perhaps by excluding
clients that also didn't download images? We can look at that but I suspect results will not differ much. Let me know if you think is necessary.
On Mon, Feb 16, 2015 at 9:23 PM, Nuria Ruiz nuria@wikimedia.org wrote:
That number is higher than I expected given that the general web was
apparently closer to 1.3% in 2010 http://stackoverflow.com/questions/9478737/browser-statistics-on-javascript-disabled . mmm.. that study looks too old to be relevant, two things on that:
*1) Numbers from 2010 do not include mobile browsers with widespread use nowadays. * For example: "Opera Mini". We have >1% of requests only from this browser (and I bet than in 2015 Yahoo is seeing quite a few of those). Note that this 1% is a more precise one, derived directly from hadoop logs, requires no guesswork.
So it is not surprising that the number of disabled javascript pageviews has gone up if you take mobile into account. Opera Mini does not support javascript in the ways you would expect: https://dev.opera.com/articles/opera-mini-and-javascript/
*2) Our data differs from global stats in significant ways. * For example, our IE6 and IE7 traffic is way higher than global stats reported by http://gs.statcounter.com/ on the month of January. And note these browser percentages are more precise estimates on our end (unlike the javascript estimate that requires some cross checking and guesswork). Also, note the total percentage we report over pageviews includes bots so excluding those our IE6 and IE7 traffic is even higher than the one I am noting below.
Browser, Percentage of total pageviews by our account, global percentage by statscounter IE6: 1.01%, 0.09% IE7: 0.7% , 0.14%
I do not expect that our numbers are going to match 100% to statscounter but I think is an OK guide to cross-check oneself, especially cause they deploy their beacons worldwide:http://gs.statcounter.com/faq#methodology
Finally, is there a way to gauge the difference in JS support between
anonymous & authenticated users from this data? No, I do not think we can do that with this dataset.
On Mon, Feb 16, 2015 at 7:19 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Thank you, Nuria!
That number is higher than I expected given that the general web was apparently closer to 1.3% in 2010 http://stackoverflow.com/questions/9478737/browser-statistics-on-javascript-disabled. Do you think there are ways to fine-tune this, perhaps by excluding clients that also didn't download images?
Finally, is there a way to gauge the difference in JS support between anonymous & authenticated users from this data?
Gabriel
On Mon, Feb 16, 2015 at 6:38 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Gabriel:
I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful.
TL;DR According to our new pageview definition ( https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7)
https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria
Do you think it's worth getting the UA distribution for CSS requests &
correlate it with the distribution for page / JS loading? Yes, we can do that. I would need to gather a new dataset for it so I've made a new task for it (https://phabricator.wikimedia.org/T89847), marking this one as complete: https://phabricator.wikimedia.org/T88560
I also like to do some research regarding IE6 /IE7 as we should see those (according to our code: https://github.com/wikimedia/mediawiki/blob/master/resources/src/startup.js) in the no JS list but we only see some UA agents there. There are definitely IE6/IE7 browsers to which we are serving javascript, just have to look in detail what is what we are serving there. Will report on this. Looks like this startup.js file is being served to all browsers regardless, so I might need to do some more fine grained queries.
Just consider the 3% as your approximate upper bound for overall traffic, big bots removed. If you just count mobile traffic, numbers in percentage are, of course, a lot higher.
Thanks,
Nuria
On Wed, Feb 18, 2015 at 8:01 AM, Gabriel Wicke gwicke@wikimedia.org wrote:
Nuria,
your explanation for the increase of the no-JS numbers makes a lot of sense to me. This is very valuable information, as it contradicts the assumption that JS support kept going up in the meantime. Thank you!
The main thing I'm slightly worried about with relatively small total numbers is that even one or two bots masquerading as old browser UAs could skew the results for those UAs. This matters especially once we are trying to establish trends based on this early measurement. Using additional behavioral factors like image (or, perhaps better, CSS) loading might help to more precisely weed out non-browser users, which could benefit our UA detection precision in general. Do you think it's worth getting the UA distribution for CSS requests & correlate it with the distribution for page / JS loading?
Gabriel
On Wed, Feb 18, 2015 at 7:17 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Sorry I forgot to address this earlier:
Do you think there are ways to fine-tune this, perhaps by excluding
clients that also didn't download images? We can look at that but I suspect results will not differ much. Let me know if you think is necessary.
On Mon, Feb 16, 2015 at 9:23 PM, Nuria Ruiz nuria@wikimedia.org wrote:
That number is higher than I expected given that the general web was
apparently closer to 1.3% in 2010 http://stackoverflow.com/questions/9478737/browser-statistics-on-javascript-disabled . mmm.. that study looks too old to be relevant, two things on that:
*1) Numbers from 2010 do not include mobile browsers with widespread use nowadays. * For example: "Opera Mini". We have >1% of requests only from this browser (and I bet than in 2015 Yahoo is seeing quite a few of those). Note that this 1% is a more precise one, derived directly from hadoop logs, requires no guesswork.
So it is not surprising that the number of disabled javascript pageviews has gone up if you take mobile into account. Opera Mini does not support javascript in the ways you would expect: https://dev.opera.com/articles/opera-mini-and-javascript/
*2) Our data differs from global stats in significant ways. * For example, our IE6 and IE7 traffic is way higher than global stats reported by http://gs.statcounter.com/ on the month of January. And note these browser percentages are more precise estimates on our end (unlike the javascript estimate that requires some cross checking and guesswork). Also, note the total percentage we report over pageviews includes bots so excluding those our IE6 and IE7 traffic is even higher than the one I am noting below.
Browser, Percentage of total pageviews by our account, global percentage by statscounter IE6: 1.01%, 0.09% IE7: 0.7% , 0.14%
I do not expect that our numbers are going to match 100% to statscounter but I think is an OK guide to cross-check oneself, especially cause they deploy their beacons worldwide:http://gs.statcounter.com/faq#methodology
Finally, is there a way to gauge the difference in JS support between
anonymous & authenticated users from this data? No, I do not think we can do that with this dataset.
On Mon, Feb 16, 2015 at 7:19 PM, Gabriel Wicke gwicke@wikimedia.org wrote:
Thank you, Nuria!
That number is higher than I expected given that the general web was apparently closer to 1.3% in 2010 http://stackoverflow.com/questions/9478737/browser-statistics-on-javascript-disabled. Do you think there are ways to fine-tune this, perhaps by excluding clients that also didn't download images?
Finally, is there a way to gauge the difference in JS support between anonymous & authenticated users from this data?
Gabriel
On Mon, Feb 16, 2015 at 6:38 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Gabriel:
I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful.
TL;DR According to our new pageview definition ( https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7)
https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria
Hi,
Here's a few thoughts about what may influence the data you're gathering.
The decision of whether a browser has sufficient support for our Grade A runtime happens client-side based on a combination of feature tests and (unfortunately) user-agent sniffing.
For this reason, our bootstrap script is written using only the most basic syntax and prototype methods (as any other methods would cause a run-time exception). For those familiar, this is somewhat similar to PHP version detection in MediaWiki. The file has to parse and run to a certain point in very old environments.
The following requests are not part of our primary javascript payload and should be excluded when interpreting bits.wikimedia.org requests for purposes of javascript "support":
* stylesheets (e.g. ".css" requests as well as load.php?...&only=styles requests) * images (e.g. ".png", ".svg" etc. as well as load.php?...&image=.. requests) * favicons and apple-touch icons (e.g. bits.wikimedia.org/favicon/.., bits.wikimedia.org/apple-touch/..) * fonts (e.g. bits.wikimedia.org/static-../../fonts/..) * events (e.g. bits.wikimedia.org/event.gif, bits.wikimedia.org/statsv) * startup module (bits.wikimedia.org/../load.php?..modules=startup)
There are also non-MediaWiki environments (ab)using bits.wikimedia.org and bypassing the startup module. As such these are loading javascript modules directly, regardless of browser. There are at least two of these that I know of:
1) Tool labs tools. Developers there may use bits.wikimedia.org to serve modules like jQuery UI. They may circumvent the startup module and unconditionally load those (which will cause errors in older browsers, but they don't care or are unaware of how this works).
2) Portals such as www.wikipedia.org and others.
For the data to be as reliable as feasibly possible, one would want to filter out these "forged" requests not produced by MediaWiki. The best way to filter out requests that bypassed the startup module is to filter out requests with no version= query parameter. As well as request with an outdated version parameter (since they can copy an old url and hardcode it in their app).
Actually, there are probably about a dozen more exceptions I can think of. I don't believe it is feasibly possible to filter everything out. Perhaps focus your next data-gathering window on a specific payload url - instead of trying to catch all javascript payloads with exclusions for wrong ones.
For example, right now in MediaWiki 1.25wmf18 the jquery/mediawiki base payload has version 20150225T221331Z and is requested by the startup module from url (grabbed from the Network tab in Chrome Dev Tools):
https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en...
Using only a specific url like that to gather user agents that support javascript will have considerably less false positives.
If you want to incorporate multiple wikis, it'll be a little more work to get all the right urls, but handpicking a dozen wikis will probably be good enough.
This also has the advantage of not being biased by devices cache size. Because, unlike all other modules, the base module is not cached in the LocalStorage. It will still benefit HTTP 304 caching however. It would help to have your window start simultaneously with the deployment of a new wmf branch to en.wikipedia.org (and other wikis you include in the experiment) so there's a fresh start with caching.
</braindump>
— Timo
On 18 Feb 2015, at 18:07, Nuria Ruiz nuria@wikimedia.org wrote:
Do you think it's worth getting the UA distribution for CSS requests & correlate it with the distribution for page / JS loading?
Yes, we can do that. I would need to gather a new dataset for it so I've made a new task for it (https://phabricator.wikimedia.org/T89847), marking this one as complete: https://phabricator.wikimedia.org/T88560
I also like to do some research regarding IE6 /IE7 as we should see those (according to our code: https://github.com/wikimedia/mediawiki/blob/master/resources/src/startup.js) in the no JS list but we only see some UA agents there. There are definitely IE6/IE7 browsers to which we are serving javascript, just have to look in detail what is what we are serving there. Will report on this. Looks like this startup.js file is being served to all browsers regardless, so I might need to do some more fine grained queries.
Just consider the 3% as your approximate upper bound for overall traffic, big bots removed. If you just count mobile traffic, numbers in percentage are, of course, a lot higher.
Thanks,
Nuria
On 17 Feb 2015, at 03:38, Nuria Ruiz nuria@wikimedia.org wrote:
Gabriel:
I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful.
TL;DR According to our new pageview definition (https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7) https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria
Thanks Timo for taking the time to write this.
The following requests are not part of our primary javascript payload and
should be excluded when >interpreting bits.wikimedia.org requests for purposes of javascript "support": Correct. I think I excluded all those. Note that I listed on methodology "bits javascript traffic" not overall "bits traffic" https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript#Me...
I will double check the startup module just to be safe.
There are also non-MediaWiki environments (ab)using bits.wikimedia.org and
bypassing the startup module. As such these are loading javascript modules directly, regardless of browser. There are at least two of these that I know of: I think our raw hive data probably does not includes the traffic from tools or wikipedia.org (need to confirm). But even if it did, the traffic of tools on bits is not significant compared to the one from wikipedia thus does not affect the overall results as we are throwing away the longtail. Note that couple days worth of traffic might be more than a 1 billion requests for javascript on bits.
Actually, there are probably about a dozen more exceptions I can think of.
I don't believe it is feasibly possible to filter everything out. Statistically I do not think you need to, given the volume of traffic in wikipedia versus the other sources, you just cannot report results with a precision of, say, 0.001%. Even very small wikis - whose traffic is insignificant compared to english wikipedia- are also being thrown away. That is to say that if in the vasque wikipedia everyone started using "browser X" w/o Javascript support it will not be counted as it represents too small of a percentage of overall traffic. Results provided are an agreggation over all wikipedia's bits raw javascript traffic versus wikipedias overall pageviews. Because we are throwing away the long tail, results come from the most trafficked wikis (our disparity in pageviews among wikis is huge). If you want to get per wiki results you need to analyze the data in a completely different fashion.
On Sat, Feb 28, 2015 at 4:48 PM, Timo Tijhof ttijhof@wikimedia.org wrote:
Hi,
Here's a few thoughts about what may influence the data you're gathering.
The decision of whether a browser has sufficient support for our Grade A runtime happens client-side based on a combination of feature tests and (unfortunately) user-agent sniffing.
For this reason, our bootstrap script is written using only the most basic syntax and prototype methods (as any other methods would cause a run-time exception). For those familiar, this is somewhat similar to PHP version detection in MediaWiki. The file has to parse and run to a certain point in very old environments.
The following requests are not part of our primary javascript payload and should be excluded when interpreting bits.wikimedia.org requests for purposes of javascript "support":
- stylesheets (e.g. ".css" requests as well as load.php?...&only=styles
requests)
- images (e.g. ".png", ".svg" etc. as well as load.php?...&image=..
requests)
- favicons and apple-touch icons (e.g. bits.wikimedia.org/favicon/..,
bits.wikimedia.org/apple-touch/..)
- fonts (e.g. bits.wikimedia.org/static-../../fonts/..)
- events (e.g. bits.wikimedia.org/event.gif, bits.wikimedia.org/statsv)
- startup module (bits.wikimedia.org/../load.php?..modules=startup)
There are also non-MediaWiki environments (ab)using bits.wikimedia.org and bypassing the startup module. As such these are loading javascript modules directly, regardless of browser. There are at least two of these that I know of:
- Tool labs tools. Developers there may use bits.wikimedia.org to serve
modules like jQuery UI. They may circumvent the startup module and unconditionally load those (which will cause errors in older browsers, but they don't care or are unaware of how this works).
- Portals such as www.wikipedia.org and others.
For the data to be as reliable as feasibly possible, one would want to filter out these "forged" requests not produced by MediaWiki. The best way to filter out requests that bypassed the startup module is to filter out requests with no version= query parameter. As well as request with an outdated version parameter (since they can copy an old url and hardcode it in their app).
Actually, there are probably about a dozen more exceptions I can think of. I don't believe it is feasibly possible to filter everything out. Perhaps focus your next data-gathering window on a specific payload url - instead of trying to catch all javascript payloads with exclusions for wrong ones.
For example, right now in MediaWiki 1.25wmf18 the jquery/mediawiki base payload has version 20150225T221331Z and is requested by the startup module from url (grabbed from the Network tab in Chrome Dev Tools):
https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en...
Using only a specific url like that to gather user agents that support javascript will have considerably less false positives.
If you want to incorporate multiple wikis, it'll be a little more work to get all the right urls, but handpicking a dozen wikis will probably be good enough.
This also has the advantage of not being biased by devices cache size. Because, unlike all other modules, the base module is not cached in the LocalStorage. It will still benefit HTTP 304 caching however. It would help to have your window start simultaneously with the deployment of a new wmf branch to en.wikipedia.org (and other wikis you include in the experiment) so there's a fresh start with caching.
</braindump>
— Timo
On 18 Feb 2015, at 18:07, Nuria Ruiz nuria@wikimedia.org wrote:
Do you think it's worth getting the UA distribution for CSS requests &
correlate it with the distribution for page / JS loading? Yes, we can do that. I would need to gather a new dataset for it so I've made a new task for it (https://phabricator.wikimedia.org/T89847), marking this one as complete: https://phabricator.wikimedia.org/T88560
I also like to do some research regarding IE6 /IE7 as we should see those (according to our code: https://github.com/wikimedia/mediawiki/blob/master/resources/src/startup.js) in the no JS list but we only see some UA agents there. There are definitely IE6/IE7 browsers to which we are serving javascript, just have to look in detail what is what we are serving there. Will report on this. Looks like this startup.js file is being served to all browsers regardless, so I might need to do some more fine grained queries.
Just consider the 3% as your approximate upper bound for overall traffic, big bots removed. If you just count mobile traffic, numbers in percentage are, of course, a lot higher.
Thanks,
Nuria
On 17 Feb 2015, at 03:38, Nuria Ruiz nuria@wikimedia.org wrote:
Gabriel:
I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful.
TL;DR According to our new pageview definition ( https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7) https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria
Note that couple days worth of traffic might be more than a 1 billion
requests for javascript on bits. Sorry, correction. Couple days worth of "javascript bits" requests comes up to 100 million requests not a 1000 million.
On Sun, Mar 1, 2015 at 4:35 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Thanks Timo for taking the time to write this.
The following requests are not part of our primary javascript payload
and should be excluded when >interpreting bits.wikimedia.org requests for purposes of javascript "support": Correct. I think I excluded all those. Note that I listed on methodology "bits javascript traffic" not overall "bits traffic" https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript#Me...
I will double check the startup module just to be safe.
There are also non-MediaWiki environments (ab)using bits.wikimedia.org and
bypassing the startup module. As such these are loading javascript modules directly, regardless of browser. There are at least two of these that I know of: I think our raw hive data probably does not includes the traffic from tools or wikipedia.org (need to confirm). But even if it did, the traffic of tools on bits is not significant compared to the one from wikipedia thus does not affect the overall results as we are throwing away the longtail. Note that couple days worth of traffic might be more than a 1 billion requests for javascript on bits.
Actually, there are probably about a dozen more exceptions I can think
of. I don't believe it is feasibly possible to filter everything out. Statistically I do not think you need to, given the volume of traffic in wikipedia versus the other sources, you just cannot report results with a precision of, say, 0.001%. Even very small wikis - whose traffic is insignificant compared to english wikipedia- are also being thrown away. That is to say that if in the vasque wikipedia everyone started using "browser X" w/o Javascript support it will not be counted as it represents too small of a percentage of overall traffic. Results provided are an agreggation over all wikipedia's bits raw javascript traffic versus wikipedias overall pageviews. Because we are throwing away the long tail, results come from the most trafficked wikis (our disparity in pageviews among wikis is huge). If you want to get per wiki results you need to analyze the data in a completely different fashion.
On Sat, Feb 28, 2015 at 4:48 PM, Timo Tijhof ttijhof@wikimedia.org wrote:
Hi,
Here's a few thoughts about what may influence the data you're gathering.
The decision of whether a browser has sufficient support for our Grade A runtime happens client-side based on a combination of feature tests and (unfortunately) user-agent sniffing.
For this reason, our bootstrap script is written using only the most basic syntax and prototype methods (as any other methods would cause a run-time exception). For those familiar, this is somewhat similar to PHP version detection in MediaWiki. The file has to parse and run to a certain point in very old environments.
The following requests are not part of our primary javascript payload and should be excluded when interpreting bits.wikimedia.org requests for purposes of javascript "support":
- stylesheets (e.g. ".css" requests as well as load.php?...&only=styles
requests)
- images (e.g. ".png", ".svg" etc. as well as load.php?...&image=..
requests)
- favicons and apple-touch icons (e.g. bits.wikimedia.org/favicon/..,
bits.wikimedia.org/apple-touch/..)
- fonts (e.g. bits.wikimedia.org/static-../../fonts/..)
- events (e.g. bits.wikimedia.org/event.gif, bits.wikimedia.org/statsv)
- startup module (bits.wikimedia.org/../load.php?..modules=startup)
There are also non-MediaWiki environments (ab)using bits.wikimedia.org and bypassing the startup module. As such these are loading javascript modules directly, regardless of browser. There are at least two of these that I know of:
- Tool labs tools. Developers there may use bits.wikimedia.org to serve
modules like jQuery UI. They may circumvent the startup module and unconditionally load those (which will cause errors in older browsers, but they don't care or are unaware of how this works).
- Portals such as www.wikipedia.org and others.
For the data to be as reliable as feasibly possible, one would want to filter out these "forged" requests not produced by MediaWiki. The best way to filter out requests that bypassed the startup module is to filter out requests with no version= query parameter. As well as request with an outdated version parameter (since they can copy an old url and hardcode it in their app).
Actually, there are probably about a dozen more exceptions I can think of. I don't believe it is feasibly possible to filter everything out. Perhaps focus your next data-gathering window on a specific payload url - instead of trying to catch all javascript payloads with exclusions for wrong ones.
For example, right now in MediaWiki 1.25wmf18 the jquery/mediawiki base payload has version 20150225T221331Z and is requested by the startup module from url (grabbed from the Network tab in Chrome Dev Tools):
https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en...
Using only a specific url like that to gather user agents that support javascript will have considerably less false positives.
If you want to incorporate multiple wikis, it'll be a little more work to get all the right urls, but handpicking a dozen wikis will probably be good enough.
This also has the advantage of not being biased by devices cache size. Because, unlike all other modules, the base module is not cached in the LocalStorage. It will still benefit HTTP 304 caching however. It would help to have your window start simultaneously with the deployment of a new wmf branch to en.wikipedia.org (and other wikis you include in the experiment) so there's a fresh start with caching.
</braindump>
— Timo
On 18 Feb 2015, at 18:07, Nuria Ruiz nuria@wikimedia.org wrote:
Do you think it's worth getting the UA distribution for CSS requests &
correlate it with the distribution for page / JS loading? Yes, we can do that. I would need to gather a new dataset for it so I've made a new task for it (https://phabricator.wikimedia.org/T89847), marking this one as complete: https://phabricator.wikimedia.org/T88560
I also like to do some research regarding IE6 /IE7 as we should see those (according to our code: https://github.com/wikimedia/mediawiki/blob/master/resources/src/startup.js) in the no JS list but we only see some UA agents there. There are definitely IE6/IE7 browsers to which we are serving javascript, just have to look in detail what is what we are serving there. Will report on this. Looks like this startup.js file is being served to all browsers regardless, so I might need to do some more fine grained queries.
Just consider the 3% as your approximate upper bound for overall traffic, big bots removed. If you just count mobile traffic, numbers in percentage are, of course, a lot higher.
Thanks,
Nuria
On 17 Feb 2015, at 03:38, Nuria Ruiz nuria@wikimedia.org wrote:
Gabriel:
I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful.
TL;DR According to our new pageview definition ( https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7) https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria
I just linked your results from https://phabricator.wikimedia.org/T58575, but really think that they should be more widely known. Do you mind writing a mail to wikitech@ or engineering@ about this finding?
Gabriel
On Sun, Mar 1, 2015 at 6:24 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Note that couple days worth of traffic might be more than a 1 billion
requests for javascript on bits. Sorry, correction. Couple days worth of "javascript bits" requests comes up to 100 million requests not a 1000 million.
On Sun, Mar 1, 2015 at 4:35 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Thanks Timo for taking the time to write this.
The following requests are not part of our primary javascript payload
and should be excluded when >interpreting bits.wikimedia.org requests for purposes of javascript "support": Correct. I think I excluded all those. Note that I listed on methodology "bits javascript traffic" not overall "bits traffic" https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript#Me...
I will double check the startup module just to be safe.
There are also non-MediaWiki environments (ab)using bits.wikimedia.org and
bypassing the startup module. As such these are loading javascript modules directly, regardless of browser. There are at least two of these that I know of: I think our raw hive data probably does not includes the traffic from tools or wikipedia.org (need to confirm). But even if it did, the traffic of tools on bits is not significant compared to the one from wikipedia thus does not affect the overall results as we are throwing away the longtail. Note that couple days worth of traffic might be more than a 1 billion requests for javascript on bits.
Actually, there are probably about a dozen more exceptions I can think
of. I don't believe it is feasibly possible to filter everything out. Statistically I do not think you need to, given the volume of traffic in wikipedia versus the other sources, you just cannot report results with a precision of, say, 0.001%. Even very small wikis - whose traffic is insignificant compared to english wikipedia- are also being thrown away. That is to say that if in the vasque wikipedia everyone started using "browser X" w/o Javascript support it will not be counted as it represents too small of a percentage of overall traffic. Results provided are an agreggation over all wikipedia's bits raw javascript traffic versus wikipedias overall pageviews. Because we are throwing away the long tail, results come from the most trafficked wikis (our disparity in pageviews among wikis is huge). If you want to get per wiki results you need to analyze the data in a completely different fashion.
On Sat, Feb 28, 2015 at 4:48 PM, Timo Tijhof ttijhof@wikimedia.org wrote:
Hi,
Here's a few thoughts about what may influence the data you're gathering.
The decision of whether a browser has sufficient support for our Grade A runtime happens client-side based on a combination of feature tests and (unfortunately) user-agent sniffing.
For this reason, our bootstrap script is written using only the most basic syntax and prototype methods (as any other methods would cause a run-time exception). For those familiar, this is somewhat similar to PHP version detection in MediaWiki. The file has to parse and run to a certain point in very old environments.
The following requests are not part of our primary javascript payload and should be excluded when interpreting bits.wikimedia.org requests for purposes of javascript "support":
- stylesheets (e.g. ".css" requests as well as load.php?...&only=styles
requests)
- images (e.g. ".png", ".svg" etc. as well as load.php?...&image=..
requests)
- favicons and apple-touch icons (e.g. bits.wikimedia.org/favicon/..,
bits.wikimedia.org/apple-touch/..)
- fonts (e.g. bits.wikimedia.org/static-../../fonts/..)
- events (e.g. bits.wikimedia.org/event.gif, bits.wikimedia.org/statsv)
- startup module (bits.wikimedia.org/../load.php?..modules=startup)
There are also non-MediaWiki environments (ab)using bits.wikimedia.org and bypassing the startup module. As such these are loading javascript modules directly, regardless of browser. There are at least two of these that I know of:
- Tool labs tools. Developers there may use bits.wikimedia.org to
serve modules like jQuery UI. They may circumvent the startup module and unconditionally load those (which will cause errors in older browsers, but they don't care or are unaware of how this works).
- Portals such as www.wikipedia.org and others.
For the data to be as reliable as feasibly possible, one would want to filter out these "forged" requests not produced by MediaWiki. The best way to filter out requests that bypassed the startup module is to filter out requests with no version= query parameter. As well as request with an outdated version parameter (since they can copy an old url and hardcode it in their app).
Actually, there are probably about a dozen more exceptions I can think of. I don't believe it is feasibly possible to filter everything out. Perhaps focus your next data-gathering window on a specific payload url - instead of trying to catch all javascript payloads with exclusions for wrong ones.
For example, right now in MediaWiki 1.25wmf18 the jquery/mediawiki base payload has version 20150225T221331Z and is requested by the startup module from url (grabbed from the Network tab in Chrome Dev Tools):
https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en...
Using only a specific url like that to gather user agents that support javascript will have considerably less false positives.
If you want to incorporate multiple wikis, it'll be a little more work to get all the right urls, but handpicking a dozen wikis will probably be good enough.
This also has the advantage of not being biased by devices cache size. Because, unlike all other modules, the base module is not cached in the LocalStorage. It will still benefit HTTP 304 caching however. It would help to have your window start simultaneously with the deployment of a new wmf branch to en.wikipedia.org (and other wikis you include in the experiment) so there's a fresh start with caching.
</braindump>
— Timo
On 18 Feb 2015, at 18:07, Nuria Ruiz nuria@wikimedia.org wrote:
Do you think it's worth getting the UA distribution for CSS requests
& correlate it with the distribution for page / JS loading? Yes, we can do that. I would need to gather a new dataset for it so I've made a new task for it (https://phabricator.wikimedia.org/T89847), marking this one as complete: https://phabricator.wikimedia.org/T88560
I also like to do some research regarding IE6 /IE7 as we should see those (according to our code: https://github.com/wikimedia/mediawiki/blob/master/resources/src/startup.js) in the no JS list but we only see some UA agents there. There are definitely IE6/IE7 browsers to which we are serving javascript, just have to look in detail what is what we are serving there. Will report on this. Looks like this startup.js file is being served to all browsers regardless, so I might need to do some more fine grained queries.
Just consider the 3% as your approximate upper bound for overall traffic, big bots removed. If you just count mobile traffic, numbers in percentage are, of course, a lot higher.
Thanks,
Nuria
On 17 Feb 2015, at 03:38, Nuria Ruiz nuria@wikimedia.org wrote:
Gabriel:
I have run through the data and have a rough estimate of how many of our pageviews are requested from browsers w/o strong javascript support. It is a preliminary rough estimate but I think is pretty useful.
TL;DR According to our new pageview definition ( https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of pageviews come from clients w/o much javascript support. But - BIG CAVEAT- this includes bots requests. If you remove the easy-too-spot-big-bots the percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7) https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria
On 2 Mar 2015, at 00:35, Nuria Ruiz nuria@wikimedia.org wrote:
Thanks Timo for taking the time to write this.
You're welcome. Thanks for this research. I'm excited about the results.
There are also non-MediaWiki environments (ab)using bits.wikimedia.org and bypassing the startup module. As such these are loading javascript modules directly, regardless of browser. There are at least two of these that I know of:
I think our raw hive data probably does not includes the traffic from tools or wikipedia.org (need to confirm). But even if it did, the traffic of tools on bits is not significant compared to the one from wikipedia thus does not affect the overall results as we are throwing away the longtail. Note that couple days worth of traffic might be more than a 1 billion requests for javascript on bits.
Unless bits.wikimedia.org traffic statistics filters out things via the Referer header, I don't see how it could not include traffic triggered by Tool Labs and www-portals like www.wikipedia.org. They make script requests to bits.wikimedia.org.
But yeah, Tool Labs traffic will be tiny in comparison. I honestly have no clue how popular our www-portals are. I'd be interested in seeing some stats on that.
Actually, there are probably about a dozen more exceptions I can think of. I don't believe it is feasibly possible to filter everything out.
Statistically I do not think you need to, given the volume of traffic in wikipedia versus the other sources, you just cannot report results with a precision of, say, 0.001%. Even very small wikis - whose traffic is insignificant compared to english wikipedia- are also being thrown away.
Point taken. Thanks :)
https://www.mediawiki.org/wiki/Analytics/Reports/Clients_without_JavaScript#...
What about IE6/IE7 According to our code base we should not be serving any Javascript to IE6/IE7 browsers other than the startup code that checks for browser compatibility. [..] However, we do see some user agents on bits for IE6/IE7. For example [IE6] is responsible of 1% of total pageviews. Likely this browser is not identified by our code as IE6 and thus is being served Javascript (this is a bug) We need to do a little bit more research here to see the javascript requests being served.
Let me know if I can be of help interpreting the result from here and investigating where they originate from in our application stack.
Would it be possible for me to query a few rows from this myself? I found documentation on Hive at wikitech[1], but couldn't find which server to use or who has access. I'm happy to ask someone, I don't need access there. Just curious if it was possible.
-- Timo
[1] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries
Let me know if I can be of help interpreting the result from here and
investigating where they originate from in our application stack.
Would it be possible for me to query a few rows from this myself? I found
documentation on Hive at wikitech[1], but couldn't find which server to use or >who has access. I'm happy to ask someone, I don't need access there. Just curious if it was possible.
Thanks for offering! Once you have filed the access requests and gotten permits we can work on this together.
I have added access info to our Hive 101 page: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries#Cluster_A...
On Tue, Mar 24, 2015 at 7:17 PM, Timo Tijhof ttijhof@wikimedia.org wrote:
https://www.mediawiki.org/wiki/Analytics/Reports/Clients_without_JavaScript#...
What about IE6/IE7 According to our code base we should not be serving any Javascript to IE6/IE7 browsers other than the startup code that checks for browser compatibility. [..] However, we do see some user agents on bits for IE6/IE7. For example [IE6] is responsible of 1% of total pageviews. Likely this browser is not identified by our code as IE6 and thus is being served Javascript (this is a bug) We need to do a little bit more research here to see the javascript requests being served.
Let me know if I can be of help interpreting the result from here and investigating where they originate from in our application stack.
Would it be possible for me to query a few rows from this myself? I found documentation on Hive at wikitech[1], but couldn't find which server to use or who has access. I'm happy to ask someone, I don't need access there. Just curious if it was possible.
-- Timo
[1] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries