tl;dr. Great question! It's gonna be complicated and/or impossible with the
data we currently collect.
I'm looking for a method to determine the parameters of the distribution of
page views per visit. I would also love to know the
distribution for the
length of time between visits. Does anyone know of any studies already done
on this topic? Google is not my friend today -- I haven't yet found
anything.
So, I can tell you my understanding of how Google Analytics does this,
which might be helpful for rigging something together in the future, but
I'll tell you up front that I'm pretty sure we don't presently have the
data to answer your questions.
For views per visit and time between visits, you need a few pieces of
machinery in place.
First, you need to tag users uniquely. In practice, this means setting a
cookie.
Second, you need to tag sessions. This can be done in post-processing, or
at the time of the request by setting a mtime (last-touch-time) cookie.
Essentially a session is a group of views bounded by some length of
inactivity. So let's say our bound is 15m. If I see hits from a single user
at noon, 12:10p, 12:11, 12:20, 12:40, 12:50, and 1:01p, I have two sessions
(12p - 12:20p, and 12:40p - 1:01p). The hit at 12:20p doesn't have another
touch until 12:40p, which is outside our touch bound.
This is enough to do views per visit and time between visits. Time on page
would require an additional heartbeat beacon from the page post-load to
confirm the user isn't idle, and the page isn't in the background. (If you
watch the web console on a page with GA enabled, you'll see these go out
periodically.)
Our problem is that we don't have ID tokens, meaning we can't do anything
involving uniques. We can approximate them using a hash of UA, IP, and some
fiddling, but ultimately we sample 1:1000 on *views*, so the proportion of
views that belong to uniques is ultimately unknown, and I'd consider
estimates to be highly unreliable.
If the data doesn't exist, the best I think I can have is the average
number of page views per visit. I have a problem
though: the comScore
numbers available at
http://reportcard.wmflabs.org/ is broken out by
region; not by site. Using this data I'll only be able to get the average
for all our properties worldwide -- which is a little bit rough. Does
anyone have access to the raw data? If so -- does it tell us the number of
uniques per site, or is it really only by region?
Several things about comScore data. First, it's always in aggregate. This
means that even if we had per-site numbers, you'd only get averages (as you
note). Second, we don't have have most of the breakouts. (I'll stop by your
desk and we can explore what's there, but sadly getting access is a
laborious process, otherwise I'd happily just give you creds.) Third, I
don't think they actually offer much in the way of visits. But we can look.
Does anyone have any better ideas?
No. Not good ideas, anyway. If you have unsampled raw logs for the subset
of views you're interested in, a heuristic for uniques is totally
reasonable. We only have sampled data (afaik), so that route is closed.
-- Context --
I'm trying to model some fundraising data to solve
the optimal banner
distribution problem (effectively what's the best way to show people
banners) . Our data on the 'number of banner impressions till donation'
indicates that people are far more likely to donate on the first banner
impression. However, this decays over time. My hypothesis is that it's
there is no difference between showing a user only one banner per visit
over multiple visits and showing multiple banners 100% of the time.
If this hypothesis is true; it will lead into fundraising developing a
banner display function that will solve the following problem statement:
"show P percent of all unique visitors, under time T, N banners with M
banners displayed per session".
--
David Schoonover
dsc(a)wikimedia.org