Timwi wrote:
An approximate and inaccurate pageview counter is better than none. Currently, we have none, and the only reason I've heard is that the fact that we use squids makes it impossible to count all pageviews reliably. Surely it would be useful to have a counter that does not aim to be precise, but at least give an approximation.
Would it be possible to re-enable the pageview counter and rephrase the message so that it says something like, "This page was viewed approximately <x> times."? The <x> is a value calculated from the non-squid pageview count. Perhaps with an "(explain)" link next to it that sends the user to a Help page that explains why the counter is approximate and how the estimate was calculated.
Do we have any statistics that allow us to estimate an approximate number of pageviews given the number of pageviews that came through to the servers?
Timwi
Why not use a daemon to scan the logs on each squid, and then only increment the database read count for (say) one in a hundred, or one in a thousand of these hits, chosen as (pseudo) random, with the increment being 100 if only 1 in 100 hits is logged, and so on. Over time, the law of large numbers will arrange for it to average out to reasonably representative hit counts.
Something as simple as the following run at intervals from a cron job should do it, combined with logrotate. Since it's mostly CPU-intensive, the analysis job can be niced down to avoid interfering with the main task of squid processing.
If this would not be practical, please let me know why.
# Note: pythonesque pseudocode, not a real program import random. time, re # and SQL stuff...
N = 1000 # or some other suitable value...
conn = make_sql_connection()
def log_article_hits(url, count): time.sleep(5) # to avoid hitting the DB too hard... # stub: should parse URL using regexp for project/language/article title, # then call SQL code that increments the article read counter by "count" return
for line in open("/var/log/squid/access.log"): # or whatever...
fields = string.split(line) if len(fields) < 7: continue
url = fields[6] if random.randint(1,N) == 1: log_article_hits(url, N)
-- Neil