Domas Mituzas wrote:
Helloes,
I think this would be a very cool way to have accurate statistics about browser usage, not only on Wikipedia.
I was hesitating to work on that simply because it is not interesting as 24/7 statistics of pageviews, though may be interesting to see a day-long snapshots with per-project+per-country split every quarter or so. For that we'd need:
a) good UA header parser (in C...)
What should it parse? Some people have talked about cutting the UA to some abstraction level but IMHO it's better to aggregate the whole header and group by browser.
So you could get something like this:
*Mozilla Firefox 2% **Mozilla Firefox 3 **Mozilla Firefox 2 **Mozilla Firefox 1 and older
*Internet Explorer 5% **Internet Explorer 8 (beta) - 0.5% **Internet Explorer 7 - 2% **Internet Explorer 6 - 2% **Internet Explorer 5 and older - 0.5% ***Mozilla/4.0 Windows 95 IE4.0 broken 0.0001%
....
Getting hits to the detail will allow to check that the filters are right. And how many different UA headers we may get? 50, 80, 100? It's perfectly acceptable.
Yes, there is information on the User Agents which shouldn't be there, most notably, IE wants to announce everywhere your OS, service pack, and even several .NET versions. But when they're aggregated, just knowing that 0.01% of the hits (not even users!) came from a Windows 3.1 isn't really breaking Foo's privacy. It might, if you included your name and address into your User-Agent, but as all sites you browse learn about it, then you have bigger problems than Wikimedia finding it. I can think of one case where you may get in trouble: your boss finding the company-customized UA, when employees weren't supposed to visit wiktionary. But he could as well install a proxy or sniff the traffic.
Plus all this data is also useful on another ways (eg. aggregations by OS, I'm sure we would get some surprises) and can itself be used as source for subsequent studies.