requests for javascript on bits.
Sorry, correction. Couple days worth of "javascript bits" requests comes up
to 100 million requests not a 1000 million.
On Sun, Mar 1, 2015 at 4:35 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
Thanks Timo for taking the time to write this.
The following requests are not part of our primary
javascript payload
and should be excluded when >interpreting
bits.wikimedia.org
requests for
purposes of javascript "support":
Correct. I think I excluded all those.
Note that I listed on methodology "bits javascript traffic" not overall
"bits traffic"
https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript#M…
I will double check the startup module just to be safe.
There are also non-MediaWiki environments
(ab)using
bits.wikimedia.org and
bypassing the startup module. As such these are
loading javascript modules
directly, regardless of browser. There are at least two of these that I
know of:
I think our raw hive data probably does not includes the traffic from
tools or
wikipedia.org (need to confirm). But even if it did, the traffic
of tools on bits is not significant compared to the one from wikipedia
thus does not affect the overall results as we are throwing away the
longtail. Note that couple days worth of traffic might be more than a 1
billion requests for javascript on bits.
Actually, there are probably about a dozen more
exceptions I can think
of. I don't believe it is feasibly possible to filter
everything out.
Statistically I do not think you need to, given the volume of traffic in
wikipedia versus the other sources, you just cannot report results with a
precision of, say, 0.001%. Even very small wikis - whose traffic is
insignificant compared to english wikipedia- are also being thrown away.
That is to say that if in the vasque wikipedia everyone started using
"browser X" w/o Javascript support it will not be counted as it represents
too small of a percentage of overall traffic. Results provided are an
agreggation over all wikipedia's bits raw javascript traffic versus
wikipedias overall pageviews. Because we are throwing away the long tail,
results come from the most trafficked wikis (our disparity in pageviews
among wikis is huge). If you want to get per wiki results you need to
analyze the data in a completely different fashion.
On Sat, Feb 28, 2015 at 4:48 PM, Timo Tijhof <ttijhof(a)wikimedia.org>
wrote:
Hi,
Here's a few thoughts about what may influence the data you're gathering.
The decision of whether a browser has sufficient support for our Grade A
runtime happens client-side based on a combination of feature tests and
(unfortunately) user-agent sniffing.
For this reason, our bootstrap script is written using only the most
basic syntax and prototype methods (as any other methods would cause a
run-time exception). For those familiar, this is somewhat similar to PHP
version detection in MediaWiki. The file has to parse and run to a certain
point in very old environments.
The following requests are not part of our primary javascript payload and
should be excluded when interpreting
bits.wikimedia.org requests for
purposes of javascript "support":
* stylesheets (e.g. ".css" requests as well as load.php?...&only=styles
requests)
* images (e.g. ".png", ".svg" etc. as well as
load.php?...&image=..
requests)
* favicons and apple-touch icons (e.g.
bits.wikimedia.org/favicon/..,
bits.wikimedia.org/apple-touch/..)
* fonts (e.g.
bits.wikimedia.org/static-../../fonts/..)
* events (e.g.
bits.wikimedia.org/event.gif,
bits.wikimedia.org/statsv)
* startup module (
bits.wikimedia.org/../load.php?..modules=startup)
There are also non-MediaWiki environments (ab)using
bits.wikimedia.org
and bypassing the startup module. As such these are loading javascript
modules directly, regardless of browser. There are at least two of these
that I know of:
1) Tool labs tools. Developers there may use
bits.wikimedia.org to serve
modules like jQuery UI. They may circumvent the startup module and
unconditionally load those (which will cause errors in older browsers, but
they don't care or are unaware of how this works).
2) Portals such as
www.wikipedia.org and others.
For the data to be as reliable as feasibly possible, one would want to
filter out these "forged" requests not produced by MediaWiki. The best way
to filter out requests that bypassed the startup module is to filter out
requests with no version= query parameter. As well as request with an
outdated version parameter (since they can copy an old url and hardcode it
in their app).
Actually, there are probably about a dozen more exceptions I can think
of. I don't believe it is feasibly possible to filter everything out.
Perhaps focus your next data-gathering window on a specific payload url -
instead of trying to catch all javascript payloads with exclusions for
wrong ones.
For example, right now in MediaWiki 1.25wmf18 the jquery/mediawiki base
payload has version 20150225T221331Z and is requested by the startup module
from url (grabbed from the Network tab in Chrome Dev Tools):
https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=e…
Using only a specific url like that to gather user agents that support
javascript will have considerably less false positives.
If you want to incorporate multiple wikis, it'll be a little more work to
get all the right urls, but handpicking a dozen wikis will probably be good
enough.
This also has the advantage of not being biased by devices cache size.
Because, unlike all other modules, the base module is not cached in the
LocalStorage. It will still benefit HTTP 304 caching however. It would help
to have your window start simultaneously with the deployment of a new wmf
branch to
en.wikipedia.org (and other wikis you include in the
experiment) so there's a fresh start with caching.
</braindump>
— Timo
On 18 Feb 2015, at 18:07, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
Do you think it's worth getting the UA
distribution for CSS requests &
correlate it with the distribution for page /
JS loading?
Yes, we can do that. I would need to gather a new dataset for it so I've
made a new task for it (
https://phabricator.wikimedia.org/T89847),
marking this one as complete:
https://phabricator.wikimedia.org/T88560
I also like to do some research regarding IE6 /IE7 as we should see those
(according to our code:
https://github.com/wikimedia/mediawiki/blob/master/resources/src/startup.js)
in the no JS list but we only see some UA agents there. There are
definitely IE6/IE7 browsers to which we are serving javascript, just have
to look in detail what is what we are serving there. Will report on this.
Looks like this startup.js file is being served to all browsers regardless,
so I might need to do some more fine grained queries.
Just consider the 3% as your approximate upper bound for overall traffic,
big bots removed. If you just count mobile traffic, numbers in percentage
are, of course, a lot higher.
Thanks,
Nuria
On 17 Feb 2015, at 03:38, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
Gabriel:
I have run through the data and have a rough estimate of how many of our
pageviews are requested from browsers w/o strong javascript support. It is
a preliminary rough estimate but I think is pretty useful.
TL;DR
According to our new pageview definition (
https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of
pageviews come from clients w/o much javascript support. But - BIG CAVEAT-
this includes bots requests. If you remove the easy-too-spot-big-bots the
percentage is <3%.
Details here (still some homework to do regarding IE6 and IE7)
https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
Thanks,
Nuria