Do I correctly remember that mediawiki projects do not keep log files and statistics and stuff (other than the ones that can be gleaned from the database itself) to reduce server load? I think I remember somebody saying the even log files are not kept... or that could have been some other reality.
I found http://en.wikipedia.org/wiki/Wikipedia:Statistics and http://stats.wikimedia.org/EN/ChartsWikipediaEN.htm
And a few other things, but nothing that looks like it would answer some of the questions being asked.
So - hitting an external log with the standard client issued image / javascript type counter thingy would get some of that. More data could be gleaned by combining that with the info available from the database.
Has this already been discussed/beaten to death? Is it a dumb idea?
Anybody got a server and bandwidth to take a gazillion hits to crunch some additional statistics? :-)
On Mon, Aug 14, 2006 at 01:56:13PM -0700, Aerik Sylvan wrote:
Do I correctly remember that mediawiki projects do not keep log files and statistics and stuff (other than the ones that can be gleaned from the database itself) to reduce server load? I think I remember somebody saying the even log files are not kept... or that could have been some other reality.
Well, it's the reality I'm in. No log files are being written on the squids or the apaches.
I found http://en.wikipedia.org/wiki/Wikipedia:Statistics and http://stats.wikimedia.org/EN/ChartsWikipediaEN.htm
And a few other things, but nothing that looks like it would answer some of the questions being asked.
So - hitting an external log with the standard client issued image / javascript type counter thingy would get some of that. More data could be gleaned by combining that with the info available from the database.
Has this already been discussed/beaten to death? Is it a dumb idea?
Some Wikipedias have systems like this set up. According to the statistics of the German Wikipedia, a user's discussion page is the second most visited page, with 10% less hits than the main page. So, apparently this method is not producing very reliable data.
The most accurate method would be to make the squid servers send short UDP messages to a central log server. This would not hit the performance of the squids very much, and the log server could aggregate data before writing it to disks from time to time.
Regards,
JeLuF
On Tue, Aug 15, 2006 at 02:04:37PM +0200, Jens Frank wrote:
On Mon, Aug 14, 2006 at 01:56:13PM -0700, Aerik Sylvan wrote:
Do I correctly remember that mediawiki projects do not keep log files and statistics and stuff (other than the ones that can be gleaned from the database itself) to reduce server load? I think I remember somebody saying the even log files are not kept... or that could have been some other reality.
Well, it's the reality I'm in. No log files are being written on the squids or the apaches.
...
So - hitting an external log with the standard client issued image / javascript type counter thingy would get some of that. More data could be gleaned by combining that with the info available from the database.
Has this already been discussed/beaten to death? Is it a dumb idea?
Some Wikipedias have systems like this set up. According to the statistics of the German Wikipedia, a user's discussion page is the second most visited page, with 10% less hits than the main page. So, apparently this method is not producing very reliable data.
Than, there must be some flaw in the implementation. Othervise the method (running some client-side javascript contacting statistics server) is industry standard (e.g. Google Analytics works this way).
Moreover, it could be easily done with modest hardware and existing statistics software. It's unnecessary to process the whole huge dataset - it would be enough to process random selection. Which can be easily done with JS ... the JS code would contact the statics server if some $random>0.999 for example.
Jan Kulveit
On Tue, Aug 15, 2006 at 03:50:25PM +0200, Jan Kulveit wrote:
On Tue, Aug 15, 2006 at 02:04:37PM +0200, Jens Frank wrote:
Than, there must be some flaw in the implementation. Othervise the method (running some client-side javascript contacting statistics server) is industry standard (e.g. Google Analytics works this way).
Moreover, it could be easily done with modest hardware and existing statistics software. It's unnecessary to process the whole huge dataset - it would be enough to process random selection. Which can be easily done with JS ... the JS code would contact the statics server if some $random>0.999 for example.
That's exactly how the JS for the German wiki works and that's also why it doesn't work. With only a few requests you can fake the statistics, since every hit on the statistics server counts as 1000 page views.
Regards,
jens
On Tue, Aug 15, 2006 at 07:28:33PM +0200, Jens Frank wrote:
On Tue, Aug 15, 2006 at 03:50:25PM +0200, Jan Kulveit wrote:
On Tue, Aug 15, 2006 at 02:04:37PM +0200, Jens Frank wrote:
Than, there must be some flaw in the implementation. Othervise the method (running some client-side javascript contacting statistics server) is industry standard (e.g. Google Analytics works this way).
Moreover, it could be easily done with modest hardware and existing statistics software. It's unnecessary to process the whole huge dataset - it would be enough to process random selection. Which can be easily done with JS ... the JS code would contact the statics server if some $random>0.999 for example.
That's exactly how the JS for the German wiki works and that's also why it doesn't work. With only a few requests you can fake the statistics, since every hit on the statistics server counts as 1000 page views.
Ok, that would mean someone deliberately plays the system. Its easy to make such plays more difficult : if hash_function(server_random_of_the_day,your_ip)>0.999 than...
Statistics server would compute the same thing and compare.
Jan Kulveit
On 8/16/06, Jan Kulveit jk-wikitech@ks.cz wrote:
Ok, that would mean someone deliberately plays the system. Its easy to make such plays more difficult : if hash_function(server_random_of_the_day,your_ip)>0.999 than...
Yes, we're far from the first people in the world to have to solve the problem of people cheating on statistics, voting, etc...
Steve
wikitech-l@lists.wikimedia.org