Timwi wrote:
An approximate and inaccurate pageview counter is
better than none.
Currently, we have none, and the only reason I've heard is that the fact
that we use squids makes it impossible to count all pageviews reliably.
Surely it would be useful to have a counter that does not aim to be
precise, but at least give an approximation.
Would it be possible to re-enable the pageview counter and rephrase the
message so that it says something like, "This page was viewed
approximately <x> times."? The <x> is a value calculated from the
non-squid pageview count. Perhaps with an "(explain)" link next to it
that sends the user to a Help page that explains why the counter is
approximate and how the estimate was calculated.
Do we have any statistics that allow us to estimate an approximate
number of pageviews given the number of pageviews that came through to
the servers?
Timwi
Why not use a daemon to scan the logs on each squid, and then only
increment the database read count for (say) one in a hundred, or one in
a thousand of these hits, chosen as (pseudo) random, with the increment
being 100 if only 1 in 100 hits is logged, and so on. Over time, the law
of large numbers will arrange for it to average out to reasonably
representative hit counts.
Something as simple as the following run at intervals from a cron job
should do it, combined with logrotate. Since it's mostly CPU-intensive,
the analysis job can be niced down to avoid interfering with the main
task of squid processing.
If this would not be practical, please let me know why.
# Note: pythonesque pseudocode, not a real program
import random. time, re # and SQL stuff...
N = 1000 # or some other suitable value...
conn = make_sql_connection()
def log_article_hits(url, count):
time.sleep(5) # to avoid hitting the DB too hard...
# stub: should parse URL using regexp for project/language/article
title,
# then call SQL code that increments the article read counter by "count"
return
for line in open("/var/log/squid/access.log"): # or whatever...
fields = string.split(line)
if len(fields) < 7:
continue
url = fields[6]
if random.randint(1,N) == 1:
log_article_hits(url, N)
-- Neil