Domas Mituzas wrote:
Helloes,
I think this would be a very cool way to have
accurate statistics
about browser usage, not only on Wikipedia.
I was hesitating to work on that simply because it is not interesting
as 24/7 statistics of pageviews, though may be interesting to see a
day-long snapshots with per-project+per-country split every quarter or
so.
For that we'd need:
a) good UA header parser (in C...)
What should it parse?
Some people have talked about cutting the UA to some abstraction level
but IMHO it's better to aggregate the whole header and group by browser.
So you could get something like this:
*Mozilla Firefox 2%
**Mozilla Firefox 3
**Mozilla Firefox 2
**Mozilla Firefox 1 and older
*Internet Explorer 5%
**Internet Explorer 8 (beta) - 0.5%
**Internet Explorer 7 - 2%
**Internet Explorer 6 - 2%
**Internet Explorer 5 and older - 0.5%
***Mozilla/4.0 Windows 95 IE4.0 broken 0.0001%
....
Getting hits to the detail will allow to check that the filters are
right. And how many different UA headers we may get? 50, 80, 100? It's
perfectly acceptable.
Yes, there is information on the User Agents which shouldn't be there,
most notably, IE wants to announce everywhere your OS, service pack, and
even several .NET versions.
But when they're aggregated, just knowing that 0.01% of the hits (not
even users!) came from a Windows 3.1 isn't really breaking Foo's privacy.
It might, if you included your name and address into your User-Agent,
but as all sites you browse learn about it, then you have bigger
problems than Wikimedia finding it.
I can think of one case where you may get in trouble: your boss finding
the company-customized UA, when employees weren't supposed to visit
wiktionary. But he could as well install a proxy or sniff the traffic.
Plus all this data is also useful on another ways (eg. aggregations by
OS, I'm sure we would get some surprises) and can itself be used as
source for subsequent studies.