Pageview counters

List overview All Threads
Download

newer

older

KONE Louty

Where is WikiMedia 1.7 ? Have you...

Timwi

5 Jul 2006 5 Jul '06

2:34 p.m.

An approximate and inaccurate pageview counter is better than none. Currently, we have none, and the only reason I've heard is that the fact that we use squids makes it impossible to count all pageviews reliably. Surely it would be useful to have a counter that does not aim to be precise, but at least give an approximation.

Would it be possible to re-enable the pageview counter and rephrase the message so that it says something like, "This page was viewed approximately <x> times."? The <x> is a value calculated from the non-squid pageview count. Perhaps with an "(explain)" link next to it that sends the user to a Help page that explains why the counter is approximate and how the estimate was calculated.

Do we have any statistics that allow us to estimate an approximate number of pageviews given the number of pageviews that came through to the servers?

Timwi

Show replies by date

Neil Harris

5 Jul 5 Jul

3:15 p.m.

Timwi wrote:

...

An approximate and inaccurate pageview counter is better than none. Currently, we have none, and the only reason I've heard is that the fact that we use squids makes it impossible to count all pageviews reliably. Surely it would be useful to have a counter that does not aim to be precise, but at least give an approximation.

Would it be possible to re-enable the pageview counter and rephrase the message so that it says something like, "This page was viewed approximately <x> times."? The <x> is a value calculated from the non-squid pageview count. Perhaps with an "(explain)" link next to it that sends the user to a Help page that explains why the counter is approximate and how the estimate was calculated.

Do we have any statistics that allow us to estimate an approximate number of pageviews given the number of pageviews that came through to the servers?

Timwi

Why not use a daemon to scan the logs on each squid, and then only increment the database read count for (say) one in a hundred, or one in a thousand of these hits, chosen as (pseudo) random, with the increment being 100 if only 1 in 100 hits is logged, and so on. Over time, the law of large numbers will arrange for it to average out to reasonably representative hit counts.

Something as simple as the following run at intervals from a cron job should do it, combined with logrotate. Since it's mostly CPU-intensive, the analysis job can be niced down to avoid interfering with the main task of squid processing.

If this would not be practical, please let me know why.

# Note: pythonesque pseudocode, not a real program import random. time, re # and SQL stuff...

N = 1000 # or some other suitable value...

conn = make_sql_connection()

def log_article_hits(url, count): time.sleep(5) # to avoid hitting the DB too hard... # stub: should parse URL using regexp for project/language/article title, # then call SQL code that increments the article read counter by "count" return

for line in open("/var/log/squid/access.log"): # or whatever...

fields = string.split(line) if len(fields) < 7: continue

url = fields[6] if random.randint(1,N) == 1: log_article_hits(url, N)

-- Neil

Steve Bennett

3:22 p.m.

On 7/5/06, Neil Harris neil@tonal.clara.co.uk wrote:

...

Why not use a daemon to scan the logs on each squid, and then only

Are these logs actually kept?

Steve

Domas Mituzas

3:37 p.m.

...

Are these logs actually kept?

Logs? What logs?

Domas

Rob Church

10:56 p.m.

On 05/07/06, Neil Harris neil@tonal.clara.co.uk wrote:

...

If this would not be practical, please let me know why.

The fact that we don't retain the logs is perhaps a wee bit prohibitive.

Rob Church

Steve Bennett

11:03 p.m.

On 7/5/06, Rob Church robchur@gmail.com wrote:

...

On 05/07/06, Neil Harris neil@tonal.clara.co.uk wrote:

...
If this would not be practical, please let me know why.

The fact that we don't retain the logs is perhaps a wee bit prohibitive.

Since so many people seem to find the idea of log analysis attractive, would it be possible to have a bit more information about: * which logs *could* be kept with a simple flick switch * why those are not kept * whether they could be turned on for brief periods (like 24 hours) to allow periodic data collection * what alternative solutions might exist

It seems like there are at least three different places where log data could be collected: * on the mysql database - probably very "expensive" * on mediawiki (ie, in php code) - probably much more attractive, could add tuning to only record every 10th or 100th hit or whatever * on the "squids" (presumably, proxy servers) - no idea * some other web mechanism, like tomcat? or whatever web server is running (or is that squid??)

Is there absolutely no way that data could be collected at any of these points, even for short periods, and even filtered?

Steve

Rob Church

11:08 p.m.

On 05/07/06, Steve Bennett stevage@gmail.com wrote:

...

why those are not kept

Last I heard, disk space. Logging each of ~11-12 thousand hits per second => full disk.

...

whether they could be turned on for brief periods (like 24 hours) to

allow periodic data collection

Possible, see below...

...

what alternative solutions might exist

It seems like there are at least three different places where log data could be collected:

on the mysql database - probably very "expensive"

Forget it.

...

on mediawiki (ie, in php code) - probably much more attractive,

could add tuning to only record every 10th or 100th hit or whatever

You can't "store" stuff in PHP, it would have to log to the file system or elsewhere anyway.

...

on the "squids" (presumably, proxy servers) - no idea

Squid in this case refers to the Squid web caching software. "The Squids" is our semi affectionate name for bundles of caching proxies that stop millions of queries from killing the rest of our cluster.

...

Is there absolutely no way that data could be collected at any of these points, even for short periods, and even filtered?

It's been thrown about before a lot, and a lot of "perhaps" is said, but not a lot of work is done. Periodic statistics collection could mean the sample is not quite consistent, but...meh.

There are lots of people stating it can be done, but not a lot of them doing it.

Rob Church

Neil Harris

6 Jul 6 Jul

11:13 a.m.

Rob Church wrote:

...

On 05/07/06, Neil Harris neil@tonal.clara.co.uk wrote:

...
If this would not be practical, please let me know why.

The fact that we don't retain the logs is perhaps a wee bit prohibitive.

Rob Church _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Aha. Then how about adding a patch to squid that goes something like:

if (do_throttled_logging && ((throttled_logging_count++ % throttled_logging_ratio) == 0)) generate_normal_squid_logfile_line(...);

that then generates a logfile output every throttled_logging_ratio hits? If this ratio is 1000, then logs one-thousandth of the normal size will be generated at one-thousandth of the normal load that might be expected from full-on logging.

Assuming N=1000, and 100 hits/second/squid, that would be one log entry per ten seconds on each machine, or a total of 10,000 log entries per day per machine, which would take up perhaps 1 Mbyte: rather different from the Gbyte of logfile which would be generated daily if full logging was turned on.

The logs can then be aggregated and analyzed in the usual ways.

-- Neil

Александр Сигачёв

5 Jul 5 Jul

5:24 p.m.

...

An approximate and inaccurate pageview counter is better than none.

We have something like this in Russian Wikipedia.

See

http://meta.wikimedia.org/wiki/Ruwiki_pgcounter

http://tools.wikimedia.de/~edwardspec/cgi-bin/top100_ng.cgi

-- Alexander Sigachov

Ilmari Karonen

9:55 p.m.

Александр Сигачёв wrote:

...

...
An approximate and inaccurate pageview counter is better than none.

We have something like this in Russian Wikipedia. http://meta.wikimedia.org/wiki/Ruwiki_pgcounter

That's clever. I can think of a couple of minor improvements, but the obvious one, if this was implemented on the English Wikipedia, would be to only insert the link randomly on, say, one page in 100.

Of course, the sample will be biased since the trick only works with Javascript, but for many statistical purposes that is acceptable.

-- Ilmari Karonen

Travis Derouin

10:23 p.m.

We are doing something similar, we've created a special page that takes an article ID and increments the counter for that. We put in a 0 px image that references it, like this:

And use a rewrite rule in apache:

RewriteRule ^/imagecounter.gif(.*)$ /index.php/Special:Imagecounter$1 [L,PT]

You have to disable the code that increments the count though in Article.php. SpecilImageCounter.php just looks like:

function wfSpecialImagecounter( $par ) { global $wgRequest, $wgSitename, $wgLanguageCode; global $wgDeferredUpdateList; $fname = "wfSpecialImagecounter"; $id = $wgRequest->getVal("id"); $t = Title::newFromID($id); if ($t != null) Article::incViewCount( $t->getArticleID() ); $u = new SiteStatsUpdate( 1, 0, 0 ); array_push( $wgDeferredUpdateList, $u ); header('Content-type: image/gif'); exit; }

On 7/5/06, Ilmari Karonen nospam@vyznev.net wrote:

...

Александр Сигачёв wrote:

...
...
An approximate and inaccurate pageview counter is better than none.

We have something like this in Russian Wikipedia. http://meta.wikimedia.org/wiki/Ruwiki_pgcounter

That's clever. I can think of a couple of minor improvements, but the obvious one, if this was implemented on the English Wikipedia, would be to only insert the link randomly on, say, one page in 100.

Of course, the sample will be biased since the trick only works with Javascript, but for many statistical purposes that is acceptable.

-- Ilmari Karonen _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Walter Vermeir

6 Jul 6 Jul

1:04 a.m.

Александр Сигачёв schreef:

...

...
An approximate and inaccurate pageview counter is better than none.

We have something like this in Russian Wikipedia.

See

http://meta.wikimedia.org/wiki/Ruwiki_pgcounter

http://tools.wikimedia.de/~edwardspec/cgi-bin/top100_ng.cgi

-- Alexander Sigachov

I am sure that many other wikis would like to have this also. But other page count systems have been removed in that past by Brion because of privacy reasons. If this systems gets the blessing of Brion then can this service been expanded to other wikis?

-- Contact: walter AT wikizine DOT org Wikizine.org - news for and about the Wikimedia community

Александр Сигачёв

5:22 a.m.

...

I am sure that many other wikis would like to have this also. But other page count systems have been removed in that past by Brion because of privacy reasons. If this systems gets the blessing of Brion then can this service been expanded to other wikis?

That's completely anonymous referrers and hits counter. The statistics is not list of hits, but number of hits for each page

Brion said it was probably okay to use this for the moment, but eventually, developers will want to replace it.

http://meta.wikimedia.org/w/index.php?title=User_talk:Anthere&diff=36335...

-- Alexander Sigachov

Rob Church

5:42 a.m.

On 06/07/06, Александр Сигачёв alexander.sigachov@gmail.com wrote:

...

Brion said it was probably okay to use this for the moment, but eventually, developers will want to replace it.

Just to ask...where was this stated? I'd have thought that there'd be some archive evidence of our CTO agreeing to this.

It might be ok to use as far as Brion is concerned, but I'm rather interested in the sort of load it's placing on Zedler...

Rob Church

Simetrical

6:35 a.m.

On 7/5/06, Rob Church robchur@gmail.com wrote:

...

Just to ask...where was this stated? I'd have thought that there'd be some archive evidence of our CTO agreeing to this.

Alexander provided a link: http://meta.wikimedia.org/w/index.php?title=User_talk:Anthere&diff=36335.... Apparently he asked Anthere, who asked Brion.

Александр Сигачёв

6:59 a.m.

...

Alexander provided a link: http://meta.wikimedia.org/w/index.php?title=User_talk:Anthere&diff=36335.... Apparently he asked Anthere, who asked Brion.

Yes, but he was not me. I actually voted against the counter (but majority of Russian Wikipedia community was for)

Developer of the counter is Edward Chernenko. http://meta.wikimedia.org/wiki/User:Edward_Chernenko

-- Alexander Sigachov http://meta.wikimedia.org/wiki/User:ajvol

Rob Church

5:53 p.m.

On 06/07/06, Simetrical Simetrical+wikitech@gmail.com wrote:

...

On 7/5/06, Rob Church robchur@gmail.com wrote:

...
Just to ask...where was this stated? I'd have thought that there'd be some archive evidence of our CTO agreeing to this.

Alexander provided a link: http://meta.wikimedia.org/w/index.php?title=User_talk:Anthere&diff=36335.... Apparently he asked Anthere, who asked Brion.

I would still consider that there would be a public archive of it.

Rob Church

6742

Age (days ago)

6743

Last active (days ago)

wikitech-l@lists.wikimedia.org

16 comments

10 participants

tags (0)

participants (10)

Domas Mituzas
Ilmari Karonen
Neil Harris
Rob Church
Simetrical
Steve Bennett
Timwi
Travis Derouin
Walter Vermeir
Александр Сигачёв