Hi all,
as Domas' magic stuff(tm) currently gathers article traffic data very efficient: Could this system be expanded to get a list of user agents used to browse Wikipedia, sorted by their count? I think this would be a very cool way to have accurate statistics about browser usage, not only on Wikipedia.
Thanks, Marco
On Sun, Nov 23, 2008 at 5:29 PM, Marco Schuster marco@harddisk.is-a-geek.org wrote:
Hi all,
as Domas' magic stuff(tm) currently gathers article traffic data very efficient: Could this system be expanded to get a list of user agents used to browse Wikipedia, sorted by their count? I think this would be a very cool way to have accurate statistics about browser usage, not only on Wikipedia.
Can you suggest a good user-agent scrubber? Many user-agents strings have various degrees of private/semi-private data stuffed into them.
I've looked at publishing user-agent stats for Wikimedia site before, but realized that I don't have enough knowledge to safely canonicize them without throwing out a ton of information. (I.e. I could break down IE vs Firefox vs Opera vs Safari; but if you want to know about less common user agents I'm not quite sure what information can be safely released)
On Sun, Nov 23, 2008 at 8:13 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
Can you suggest a good user-agent scrubber? Many user-agents strings have various degrees of private/semi-private data stuffed into them.
I've looked at publishing user-agent stats for Wikimedia site before, but realized that I don't have enough knowledge to safely canonicize them without throwing out a ton of information. (I.e. I could break down IE vs Firefox vs Opera vs Safari; but if you want to know about less common user agents I'm not quite sure what information can be safely released)
PHP has a built-in function that does this:
http://www.php.net/manual/en/function.get-browser.php
It's apparently configurable using a .ini file. Probably there are some fairly good ones available.
http://browsers.garykeith.com/index.asp
The Browsecap project is behind the browser detection.
However, I'd avoid use of the internal PHP function. Using the internal function means that it is up to the system administrator to keep the browscap.ini file. And on shared hosting, you're dependent on the hosting provider, most of whom barely ever update that file.
There is an alternate way to use the file, you can search the internet for it. It's actually just a real simple php function to replace the internal one. Though I notice that there is a project on google code that seams to extend with a few more features: http://code.google.com/p/phpbrowscap/
~Daniel Friesen (Dantman, Nadir-Seen-Fire) ~Profile/Portfolio: http://nadir-seen-fire.com -The Nadir-Point Group (http://nadir-point.com) --It's Wiki-Tools subgroup (http://wiki-tools.com) --The ElectronicMe project (http://electronic-me.org) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) --Animepedia (http://anime.wikia.com) --Narutopedia (http://naruto.wikia.com)
Aryeh Gregor wrote:
On Sun, Nov 23, 2008 at 8:13 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
Can you suggest a good user-agent scrubber? Many user-agents strings have various degrees of private/semi-private data stuffed into them.
I've looked at publishing user-agent stats for Wikimedia site before, but realized that I don't have enough knowledge to safely canonicize them without throwing out a ton of information. (I.e. I could break down IE vs Firefox vs Opera vs Safari; but if you want to know about less common user agents I'm not quite sure what information can be safely released)
PHP has a built-in function that does this:
http://www.php.net/manual/en/function.get-browser.php
It's apparently configurable using a .ini file. Probably there are some fairly good ones available.
Helloes,
I think this would be a very cool way to have accurate statistics about browser usage, not only on Wikipedia.
I was hesitating to work on that simply because it is not interesting as 24/7 statistics of pageviews, though may be interesting to see a day-long snapshots with per-project+per-country split every quarter or so. For that we'd need:
a) good UA header parser (in C...) b) per-country (geoip) aggregator filter c) managed short-term snapshots released every Xth period of time
a) is something I'd be really lazy to write myself b) is on my top-list for other reasons (per-project/per-country counts may be interesting) - so I will eventually implement it (promised it to Stu ;-) c) is a bit... complicated to work on, and _maybe_ could be done manually. of course, it is not much data, so we can leave it running non/stop, with daily aggregation schedule.
On Tuesday 25 November 2008 17:19:13 Domas Mituzas wrote:
I think this would be a very cool way to have accurate statistics about browser usage, not only on Wikipedia.
I was hesitating to work on that simply because it is not interesting as 24/7 statistics of pageviews, though may be interesting to see a day-long snapshots with per-project+per-country split every quarter or
Perhaps it is not as interesting, but it is extremely useful! Having accurate browser usage statistics would mean that we could precisely know what browser quirks should we support, in MediaWiki and Wikipedia's Javascript.
On Tue, Nov 25, 2008 at 11:39 AM, Nikola Smolenski smolensk@eunet.yu wrote: [snip]
Having accurate browser usage statistics would mean that we could precisely know what browser quirks should we support, in MediaWiki and Wikipedia's Javascript.
I expect the goal should be "Work reasonably well for 99.999(pick your length)% of users". While the average counts might be interesting (i.e. that firefox is closer to 30% than the often claimed penetration numbers for firefox) I don't know that precise numbers are actually very helpful for most compatibility purposes: I don't think we could ignore MSIE 5.x just because it is only 0.05% of requests from JS enabled browsers (which it is on enwp), or that we could do something different because MSIE 6.x is only 20.59% (likewise).
2008/11/25 Gregory Maxwell gmaxwell@gmail.com:
On Tue, Nov 25, 2008 at 11:39 AM, Nikola Smolenski smolensk@eunet.yu wrote: [snip]
Having accurate browser usage statistics would mean that we could precisely know what browser quirks should we support, in MediaWiki and Wikipedia's Javascript.
I expect the goal should be "Work reasonably well for 99.999(pick your length)% of users". While the average counts might be interesting (i.e. that firefox is closer to 30% than the often claimed penetration numbers for firefox) I don't know that precise numbers are actually very helpful for most compatibility purposes: I don't think we could ignore MSIE 5.x just because it is only 0.05% of requests from JS enabled browsers (which it is on enwp), or that we could do something different because MSIE 6.x is only 20.59% (likewise).
When doing a cost-benefit analysis you need to factor in the cost. Determining whether or not it is worth ignoring MSIE 5.x requires knowing both how many people use it and how difficult it would be to cater for them. You can't draw a conclusion from only half the story.
On Tue, Nov 25, 2008 at 12:05 PM, Thomas Dalton thomas.dalton@gmail.com wrote:
When doing a cost-benefit analysis you need to factor in the cost. Determining whether or not it is worth ignoring MSIE 5.x requires knowing both how many people use it and how difficult it would be to cater for them. You can't draw a conclusion from only half the story.
Fair enough.
Gregory Maxwell wrote:
On Tue, Nov 25, 2008 at 11:39 AM, Nikola Smolenski smolensk@eunet.yu wrote: [snip]
Having accurate browser usage statistics would mean that we could precisely know what browser quirks should we support, in MediaWiki and Wikipedia's Javascript.
I expect the goal should be "Work reasonably well for 99.999(pick your length)% of users". While the average counts might be interesting (i.e. that firefox is closer to 30% than the often claimed penetration numbers for firefox) I don't know that precise numbers are actually very helpful for most compatibility purposes: I don't think we could ignore MSIE 5.x just because it is only 0.05% of requests from JS enabled browsers (which it is on enwp), or that we could do something different because MSIE 6.x is only 20.59% (likewise).
I thought I had heard that wikibits.js (and possibly more) was already broken on IE5 and older?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Alex wrote:
I thought I had heard that wikibits.js (and possibly more) was already broken on IE5 and older?
If you know of specific breakages, let us know. We generally try to at least let things degrade gracefully.
- -- brion
On Tue, Nov 25, 2008 at 12:01 PM, Gregory Maxwell gmaxwell@gmail.com wrote:
I don't think we could ignore MSIE 5.x just because it is only 0.05% of requests from JS enabled browsers (which it is on enwp), or that we could do something different because MSIE 6.x is only 20.59% (likewise).
We already do. For instance, I deliberately ignored IE6 when adding support for filetype-based icons for external links. I used [href$=.pdf] and so on despite the fact that I knew IE6 didn't support that, and declined a request to add in JavaScript-based fallback for IE6. In this case, the fallback was graceful, with just the regular external link icon showing up instead of the special one, so I didn't view support for IE6 as worth it.
That kind of determination can be made much more sensibly if we know browser usage figures. (Although in this particular case, I'm fairly sure IE6 was something like 50% at the time -- I didn't care because it was a trivial issue and IE6 was obsolescent.)
On Tue, Nov 25, 2008 at 5:31 PM, Platonides Platonides@gmail.com wrote:
Some people have talked about cutting the UA to some abstraction level but IMHO it's better to aggregate the whole header and group by browser.
So you could get something like this:
*Mozilla Firefox 2% **Mozilla Firefox 3 **Mozilla Firefox 2 **Mozilla Firefox 1 and older
*Internet Explorer 5% **Internet Explorer 8 (beta) - 0.5% **Internet Explorer 7 - 2% **Internet Explorer 6 - 2% **Internet Explorer 5 and older - 0.5% ***Mozilla/4.0 Windows 95 IE4.0 broken 0.0001%
....
A breakdown at the granularity of browser version/OS version should also be available. So we should be able to tell the percentage of IE6 SP1 on Windows XP, for instance, at least. (The service pack number can be significant for IE: IE bugs are sometimes fixed in service packs.)
Domas Mituzas wrote:
Helloes,
I think this would be a very cool way to have accurate statistics about browser usage, not only on Wikipedia.
I was hesitating to work on that simply because it is not interesting as 24/7 statistics of pageviews, though may be interesting to see a day-long snapshots with per-project+per-country split every quarter or so. For that we'd need:
a) good UA header parser (in C...)
What should it parse? Some people have talked about cutting the UA to some abstraction level but IMHO it's better to aggregate the whole header and group by browser.
So you could get something like this:
*Mozilla Firefox 2% **Mozilla Firefox 3 **Mozilla Firefox 2 **Mozilla Firefox 1 and older
*Internet Explorer 5% **Internet Explorer 8 (beta) - 0.5% **Internet Explorer 7 - 2% **Internet Explorer 6 - 2% **Internet Explorer 5 and older - 0.5% ***Mozilla/4.0 Windows 95 IE4.0 broken 0.0001%
....
Getting hits to the detail will allow to check that the filters are right. And how many different UA headers we may get? 50, 80, 100? It's perfectly acceptable.
Yes, there is information on the User Agents which shouldn't be there, most notably, IE wants to announce everywhere your OS, service pack, and even several .NET versions. But when they're aggregated, just knowing that 0.01% of the hits (not even users!) came from a Windows 3.1 isn't really breaking Foo's privacy. It might, if you included your name and address into your User-Agent, but as all sites you browse learn about it, then you have bigger problems than Wikimedia finding it. I can think of one case where you may get in trouble: your boss finding the company-customized UA, when employees weren't supposed to visit wiktionary. But he could as well install a proxy or sniff the traffic.
Plus all this data is also useful on another ways (eg. aggregations by OS, I'm sure we would get some surprises) and can itself be used as source for subsequent studies.
On Tue, Nov 25, 2008 at 11:31 PM, Platonides Platonides@gmail.com wrote:
What should it parse? Some people have talked about cutting the UA to some abstraction level but IMHO it's better to aggregate the whole header and group by browser. Getting hits to the detail will allow to check that the filters are right. And how many different UA headers we may get? 50, 80, 100? It's perfectly acceptable.
I'd basically think of 300 different UAs, but that shouldn't be a major problem to handle, I think.
Marco
I'd basically think of 300 different UAs, but that shouldn't be a major problem to handle, I think.
don't we handle millions of page names already? :-)
On Tue, Nov 25, 2008 at 5:31 PM, Platonides Platonides@gmail.com wrote: [snip]
Getting hits to the detail will allow to check that the filters are right. And how many different UA headers we may get? 50, 80, 100? It's perfectly acceptable.
On Tue, Nov 25, 2008 at 5:52 PM, Marco Schuster marco@harddisk.is-a-geek.org wrote:
I'd basically think of 300 different UAs, but that shouldn't be a major problem to handle, I think.
Only counting 1:100 JS executing browsers hitting enwp there were 78,033 unique user agent strings yesterday.
Really. Making a manual mapping will not work.
This is due to all the weird crap that gets thrown into the strings which takes me back to my original post.
Gregory Maxwell wrote:
On Tue, Nov 25, 2008 at 5:31 PM, Platonides wrote: [snip]
Getting hits to the detail will allow to check that the filters are right. And how many different UA headers we may get? 50, 80, 100? It's perfectly acceptable.
On Tue, Nov 25, 2008 at 5:52 PM, Marco Schuster wrote:
I'd basically think of 300 different UAs, but that shouldn't be a major problem to handle, I think.
Only counting 1:100 JS executing browsers hitting enwp there were 78,033 unique user agent strings yesterday.
This is due to all the weird crap that gets thrown into the strings which takes me back to my original post.
Sometimes to the point of making almost unique to some machines http://meta.wikimedia.org/w/index.php?title=Vandalism_reports&diff=prev&...
Really. Making a manual mapping will not work.
Not neccessarily manual but I thought it was a number easier to abstract and review.
Could you share the list of headers?
On Wed, Nov 26, 2008 at 6:08 AM, Platonides Platonides@gmail.com wrote:
Gregory Maxwell wrote:
On Tue, Nov 25, 2008 at 5:31 PM, Platonides wrote: [snip]
Getting hits to the detail will allow to check that the filters are right. And how many different UA headers we may get? 50, 80, 100? It's perfectly acceptable.
On Tue, Nov 25, 2008 at 5:52 PM, Marco Schuster wrote:
I'd basically think of 300 different UAs, but that shouldn't be a major problem to handle, I think.
Only counting 1:100 JS executing browsers hitting enwp there were 78,033 unique user agent strings yesterday.
This is due to all the weird crap that gets thrown into the strings which takes me back to my original post.
Sometimes to the point of making almost unique to some machines http://meta.wikimedia.org/w/index.php?title=Vandalism_reports&diff=prev&...
Really. Making a manual mapping will not work.
Not neccessarily manual but I thought it was a number easier to abstract and review.
Could you share the list of headers?
If you'd like to try writing a scrubber I'd be glad to run it and give you feedback. If you need some examples of weird agents, I can make some for you.
wikitech-l@lists.wikimedia.org