An approximate and inaccurate pageview counter is better than none. Currently, we have none, and the only reason I've heard is that the fact that we use squids makes it impossible to count all pageviews reliably. Surely it would be useful to have a counter that does not aim to be precise, but at least give an approximation.
Would it be possible to re-enable the pageview counter and rephrase the message so that it says something like, "This page was viewed approximately <x> times."? The <x> is a value calculated from the non-squid pageview count. Perhaps with an "(explain)" link next to it that sends the user to a Help page that explains why the counter is approximate and how the estimate was calculated.
Do we have any statistics that allow us to estimate an approximate number of pageviews given the number of pageviews that came through to the servers?
Timwi
Timwi wrote:
An approximate and inaccurate pageview counter is better than none. Currently, we have none, and the only reason I've heard is that the fact that we use squids makes it impossible to count all pageviews reliably. Surely it would be useful to have a counter that does not aim to be precise, but at least give an approximation.
Would it be possible to re-enable the pageview counter and rephrase the message so that it says something like, "This page was viewed approximately <x> times."? The <x> is a value calculated from the non-squid pageview count. Perhaps with an "(explain)" link next to it that sends the user to a Help page that explains why the counter is approximate and how the estimate was calculated.
Do we have any statistics that allow us to estimate an approximate number of pageviews given the number of pageviews that came through to the servers?
Timwi
Why not use a daemon to scan the logs on each squid, and then only increment the database read count for (say) one in a hundred, or one in a thousand of these hits, chosen as (pseudo) random, with the increment being 100 if only 1 in 100 hits is logged, and so on. Over time, the law of large numbers will arrange for it to average out to reasonably representative hit counts.
Something as simple as the following run at intervals from a cron job should do it, combined with logrotate. Since it's mostly CPU-intensive, the analysis job can be niced down to avoid interfering with the main task of squid processing.
If this would not be practical, please let me know why.
# Note: pythonesque pseudocode, not a real program import random. time, re # and SQL stuff...
N = 1000 # or some other suitable value...
conn = make_sql_connection()
def log_article_hits(url, count): time.sleep(5) # to avoid hitting the DB too hard... # stub: should parse URL using regexp for project/language/article title, # then call SQL code that increments the article read counter by "count" return
for line in open("/var/log/squid/access.log"): # or whatever...
fields = string.split(line) if len(fields) < 7: continue
url = fields[6] if random.randint(1,N) == 1: log_article_hits(url, N)
-- Neil
On 7/5/06, Neil Harris neil@tonal.clara.co.uk wrote:
Why not use a daemon to scan the logs on each squid, and then only
Are these logs actually kept?
Steve
On 05/07/06, Neil Harris neil@tonal.clara.co.uk wrote:
If this would not be practical, please let me know why.
The fact that we don't retain the logs is perhaps a wee bit prohibitive.
Rob Church
On 7/5/06, Rob Church robchur@gmail.com wrote:
On 05/07/06, Neil Harris neil@tonal.clara.co.uk wrote:
If this would not be practical, please let me know why.
The fact that we don't retain the logs is perhaps a wee bit prohibitive.
Since so many people seem to find the idea of log analysis attractive, would it be possible to have a bit more information about: * which logs *could* be kept with a simple flick switch * why those are not kept * whether they could be turned on for brief periods (like 24 hours) to allow periodic data collection * what alternative solutions might exist
It seems like there are at least three different places where log data could be collected: * on the mysql database - probably very "expensive" * on mediawiki (ie, in php code) - probably much more attractive, could add tuning to only record every 10th or 100th hit or whatever * on the "squids" (presumably, proxy servers) - no idea * some other web mechanism, like tomcat? or whatever web server is running (or is that squid??)
Is there absolutely no way that data could be collected at any of these points, even for short periods, and even filtered?
Steve
On 05/07/06, Steve Bennett stevage@gmail.com wrote:
- why those are not kept
Last I heard, disk space. Logging each of ~11-12 thousand hits per second => full disk.
- whether they could be turned on for brief periods (like 24 hours) to
allow periodic data collection
Possible, see below...
- what alternative solutions might exist
It seems like there are at least three different places where log data could be collected:
- on the mysql database - probably very "expensive"
Forget it.
- on mediawiki (ie, in php code) - probably much more attractive,
could add tuning to only record every 10th or 100th hit or whatever
You can't "store" stuff in PHP, it would have to log to the file system or elsewhere anyway.
- on the "squids" (presumably, proxy servers) - no idea
Squid in this case refers to the Squid web caching software. "The Squids" is our semi affectionate name for bundles of caching proxies that stop millions of queries from killing the rest of our cluster.
Is there absolutely no way that data could be collected at any of these points, even for short periods, and even filtered?
It's been thrown about before a lot, and a lot of "perhaps" is said, but not a lot of work is done. Periodic statistics collection could mean the sample is not quite consistent, but...meh.
There are lots of people stating it can be done, but not a lot of them doing it.
Rob Church
Rob Church wrote:
On 05/07/06, Neil Harris neil@tonal.clara.co.uk wrote:
If this would not be practical, please let me know why.
The fact that we don't retain the logs is perhaps a wee bit prohibitive.
Rob Church _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Aha. Then how about adding a patch to squid that goes something like:
if (do_throttled_logging && ((throttled_logging_count++ % throttled_logging_ratio) == 0)) generate_normal_squid_logfile_line(...);
that then generates a logfile output every throttled_logging_ratio hits? If this ratio is 1000, then logs one-thousandth of the normal size will be generated at one-thousandth of the normal load that might be expected from full-on logging.
Assuming N=1000, and 100 hits/second/squid, that would be one log entry per ten seconds on each machine, or a total of 10,000 log entries per day per machine, which would take up perhaps 1 Mbyte: rather different from the Gbyte of logfile which would be generated daily if full logging was turned on.
The logs can then be aggregated and analyzed in the usual ways.
-- Neil
An approximate and inaccurate pageview counter is better than none.
We have something like this in Russian Wikipedia.
See
http://meta.wikimedia.org/wiki/Ruwiki_pgcounter
http://tools.wikimedia.de/~edwardspec/cgi-bin/top100_ng.cgi
-- Alexander Sigachov
Александр Сигачёв wrote:
An approximate and inaccurate pageview counter is better than none.
We have something like this in Russian Wikipedia. http://meta.wikimedia.org/wiki/Ruwiki_pgcounter
That's clever. I can think of a couple of minor improvements, but the obvious one, if this was implemented on the English Wikipedia, would be to only insert the link randomly on, say, one page in 100.
Of course, the sample will be biased since the trick only works with Javascript, but for many statistical purposes that is acceptable.
We are doing something similar, we've created a special page that takes an article ID and increments the counter for that. We put in a 0 px image that references it, like this:
<img src="/imagecounter.gif?id=29204" height=0 width=0>
And use a rewrite rule in apache:
RewriteRule ^/imagecounter.gif(.*)$ /index.php/Special:Imagecounter$1 [L,PT]
You have to disable the code that increments the count though in Article.php. SpecilImageCounter.php just looks like:
function wfSpecialImagecounter( $par ) { global $wgRequest, $wgSitename, $wgLanguageCode; global $wgDeferredUpdateList; $fname = "wfSpecialImagecounter"; $id = $wgRequest->getVal("id"); $t = Title::newFromID($id); if ($t != null) Article::incViewCount( $t->getArticleID() ); $u = new SiteStatsUpdate( 1, 0, 0 ); array_push( $wgDeferredUpdateList, $u ); header('Content-type: image/gif'); exit; }
On 7/5/06, Ilmari Karonen nospam@vyznev.net wrote:
Александр Сигачёв wrote:
An approximate and inaccurate pageview counter is better than none.
We have something like this in Russian Wikipedia. http://meta.wikimedia.org/wiki/Ruwiki_pgcounter
That's clever. I can think of a couple of minor improvements, but the obvious one, if this was implemented on the English Wikipedia, would be to only insert the link randomly on, say, one page in 100.
Of course, the sample will be biased since the trick only works with Javascript, but for many statistical purposes that is acceptable.
-- Ilmari Karonen _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Александр Сигачёв schreef:
An approximate and inaccurate pageview counter is better than none.
We have something like this in Russian Wikipedia.
See
http://meta.wikimedia.org/wiki/Ruwiki_pgcounter
http://tools.wikimedia.de/~edwardspec/cgi-bin/top100_ng.cgi
-- Alexander Sigachov
I am sure that many other wikis would like to have this also. But other page count systems have been removed in that past by Brion because of privacy reasons. If this systems gets the blessing of Brion then can this service been expanded to other wikis?
I am sure that many other wikis would like to have this also. But other page count systems have been removed in that past by Brion because of privacy reasons. If this systems gets the blessing of Brion then can this service been expanded to other wikis?
That's completely anonymous referrers and hits counter. The statistics is not list of hits, but number of hits for each page
Brion said it was probably okay to use this for the moment, but eventually, developers will want to replace it.
http://meta.wikimedia.org/w/index.php?title=User_talk:Anthere&diff=36335...
-- Alexander Sigachov
On 06/07/06, Александр Сигачёв alexander.sigachov@gmail.com wrote:
Brion said it was probably okay to use this for the moment, but eventually, developers will want to replace it.
Just to ask...where was this stated? I'd have thought that there'd be some archive evidence of our CTO agreeing to this.
It might be ok to use as far as Brion is concerned, but I'm rather interested in the sort of load it's placing on Zedler...
Rob Church
On 7/5/06, Rob Church robchur@gmail.com wrote:
Just to ask...where was this stated? I'd have thought that there'd be some archive evidence of our CTO agreeing to this.
Alexander provided a link: http://meta.wikimedia.org/w/index.php?title=User_talk:Anthere&diff=36335.... Apparently he asked Anthere, who asked Brion.
Alexander provided a link: http://meta.wikimedia.org/w/index.php?title=User_talk:Anthere&diff=36335.... Apparently he asked Anthere, who asked Brion.
Yes, but he was not me. I actually voted against the counter (but majority of Russian Wikipedia community was for)
Developer of the counter is Edward Chernenko. http://meta.wikimedia.org/wiki/User:Edward_Chernenko
-- Alexander Sigachov http://meta.wikimedia.org/wiki/User:ajvol
On 06/07/06, Simetrical Simetrical+wikitech@gmail.com wrote:
On 7/5/06, Rob Church robchur@gmail.com wrote:
Just to ask...where was this stated? I'd have thought that there'd be some archive evidence of our CTO agreeing to this.
Alexander provided a link: http://meta.wikimedia.org/w/index.php?title=User_talk:Anthere&diff=36335.... Apparently he asked Anthere, who asked Brion.
I would still consider that there would be a public archive of it.
Rob Church
wikitech-l@lists.wikimedia.org