But other page count systems have been removed in that past by Brion because of privacy reasons.
Well, it *is* a pretty good reason. If you store any logs and are a high-profile source of public information and those logs can in any way be linked back to a specific user, then you must assume that sooner or later someone may take you to court to get access to those logs. Suppose a "person of interest" has been reading Wikipedia information on the chemistry of explosives, or reading up on biological pathogens, or military installations, etc etc etc. That is exactly the kind of thing that certain areas of law enforcement would like to know, and be able to use against people in court. Before you say "conspiracy theory!", remember that Google has this problem (for searches that people have conducted, which it does record), and libraries have this problem (for books that people have borrowed, which libraries also record). I'm actually surprised that the Wikipedia has not had this problem yet, and I can only presume that it's because there are no logs. The single easiest way to avoid the problem is to not keep any logs (besides those which are already public, such as the edit histories). There's a counterargument that some of these people may really be evil, but the reality is that the databases are located in the US, and the current US government has repeatedly demonstrated a thorough contempt for civil liberties (fingerprinting foreign nationals entering the US as though they were criminals, arresting people wearing T-shirts with protest slogans, illegal wire taps, indefinite imprisonment without due process at Guantanamo, the practise of "rendition", arresting people photographing bridges, and the list of abuses goes on and on and on). For my 2 cents, concern over legal problems & potential abuse of the data far outweighs my desire to know how many people have viewed say the "Mickey Mouse" page.
All the best, Nick.
On 7/6/06, Nick Jenkins nickpj@gmail.com wrote:
But other page count systems have been removed in that past by Brion because of privacy reasons.
Well, it *is* a pretty good reason. If you store any logs and are a high-profile source of public information and those logs can in any way be linked back to a specific user, then you must assume that sooner or later someone may take you to court to get access to those logs.
And then the way to stop this is to abstract the logs for the traffic we want, and throw the raw logs away as quickly as possible. Something of the the level of data that google trends can provide to the public is basically the type of thing we'd want to have (broad numbers on the only most popular search terms/pages).
On 7/6/06, Abigail Brady morwen@evilmagic.org wrote:
And then the way to stop this is to abstract the logs for the traffic we want, and throw the raw logs away as quickly as possible. Something of the the level of data that google trends can provide to the public is basically the type of thing we'd want to have (broad numbers on the only most popular search terms/pages).
Yep. If the biggest concerns are disk space and privacy, then the answer is obviously to collect logs for short periods of time that look vaguely like:
[[George W Bush]] 130.158.1.4 1/4/2006 12:00 [[Bill Clinton] 130.158.1.4 1/4/2006:12:01 [[George W Bush]] 200.0.0.4 1/4/2006:12:01
then every few hours or even minutes reprocess them into this sort of format: [[George W Bush]] 2 1/4/2006 [[Bill Clinton]] 1 1/4/2006
and discard the original log files. Less disk space (entries that receive less than N hits could even be discarded altogether from the aggregate log) and no privacy concerns.
I understand if there's no one to actually implement this at the moment though.
Steve
On Thu, Jul 06, 2006 at 07:58:45AM +0200, Steve Bennett wrote:
On 7/6/06, Abigail Brady morwen@evilmagic.org wrote:
And then the way to stop this is to abstract the logs for the traffic we want, and throw the raw logs away as quickly as possible. Something of the the level of data that google trends can provide to the public is basically the type of thing we'd want to have (broad numbers on the only most popular search terms/pages).
Yep. If the biggest concerns are disk space and privacy, then the answer is obviously to collect logs for short periods of time that look vaguely like:
[[George W Bush]] 130.158.1.4 1/4/2006 12:00 [[Bill Clinton] 130.158.1.4 1/4/2006:12:01 [[George W Bush]] 200.0.0.4 1/4/2006:12:01
then every few hours or even minutes reprocess them into this sort of format: [[George W Bush]] 2 1/4/2006 [[Bill Clinton]] 1 1/4/2006
and discard the original log files. Less disk space (entries that receive less than N hits could even be discarded altogether from the aggregate log) and no privacy concerns.
I understand if there's no one to actually implement this at the moment though.
I'd like to revive this discussion.
I have time (several full workweeks, if needed) to implement this at the moment. Is there anyone else who would be interested, would be capable of helping, or would be capable of authorizing the work?
-Erik Garrison
I'm also interested....but I'm concerned that a vandal could find articles that are never looked at and target those. RC patrol can't get everything.
On 7/25/06, Erik Garrison erik.garrison@gmail.com wrote:
On Thu, Jul 06, 2006 at 07:58:45AM +0200, Steve Bennett wrote:
On 7/6/06, Abigail Brady morwen@evilmagic.org wrote:
And then the way to stop this is to abstract the logs for the traffic we want, and throw the raw logs away as quickly as possible. Something of the the level of data that google trends can provide to the public is basically the type of thing we'd want to have (broad numbers on the only most popular search terms/pages).
Yep. If the biggest concerns are disk space and privacy, then the answer is obviously to collect logs for short periods of time that look vaguely like:
[[George W Bush]] 130.158.1.4 1/4/2006 12:00 [[Bill Clinton] 130.158.1.4 1/4/2006:12:01 [[George W Bush]] 200.0.0.4 1/4/2006:12:01
then every few hours or even minutes reprocess them into this sort of
format:
[[George W Bush]] 2 1/4/2006 [[Bill Clinton]] 1 1/4/2006
and discard the original log files. Less disk space (entries that receive less than N hits could even be discarded altogether from the aggregate log) and no privacy concerns.
I understand if there's no one to actually implement this at the moment
though.
I'd like to revive this discussion.
I have time (several full workweeks, if needed) to implement this at the moment. Is there anyone else who would be interested, would be capable of helping, or would be capable of authorizing the work?
-Erik Garrison _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On 7/25/06, mboverload mboverload@gmail.com wrote:
I'm also interested....but I'm concerned that a vandal could find articles that are never looked at and target those. RC patrol can't get everything.
More information is a good thing, not a bad thing. If there was a mechanism by which a vandal could target rarely viewed pages, then there could be a mechanism whereby changes made to rarely viewed pages (especially by anons!) are highlighted to patrollers.
Steve
Erik, Thanks for re-starting this thread!
It seems there are two main questions:
What would be the uses of readership data in the various forms that are potentially available? -and- How could privacy concerns best be balanced and addressed in the context of those potential uses?
I have access, for starters, to .2TB of disk that could be devoted to readership data; it seems storage and other technical hurdles could be overcome if there is consensus about what data the community wants stored, with what privacy guarantees, to what end.
There's some discussion of these questions at http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikidemia/Quant/Readershi... http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikidemia/Quant/Security, and further contributions to these pages would be more than welcome; the Thursday afternoon session at Hacking Days (ie, a week from tomorrow) will hopefully press forward to some (at least interim) solution.
Jeremy
On Wed, Jul 26, 2006 at 09:37:05AM +0200, Steve Bennett wrote:
On 7/25/06, mboverload mboverload@gmail.com wrote:
I'm also interested....but I'm concerned that a vandal could find articles that are never looked at and target those. RC patrol can't get everything.
More information is a good thing, not a bad thing. If there was a mechanism by which a vandal could target rarely viewed pages, then there could be a mechanism whereby changes made to rarely viewed pages (especially by anons!) are highlighted to patrollers.
Exactly.
I think there are a lot of tools of this nature which could be created to aid the WP community in general, researchers (of which I am one), and site admins in creating, understanding, and maintaining this community. I've set up a page (Wikipedia:Pageviews) to discuss possible problems with maintaining such data, tools to counter them, and tools that use pageview sampling to do useful things for the community.
Wikipedia's individual articles are immensely dependent on readership for their survival, stability, and improvement. It would be a shame to continue to stall on this crucial issue.
As Jeremy writes, this is something which will be discussed now and through Hacking Days. If you are interested please contribute te the pages he mentions, as well as WP:Pageviews.
I and others who have taken part in this thread have ample time and resources to solve this problem. I _refuse_ to let this issue die again.
-Erik
It's a good thing. But instead of discussing on [[en:Wikipedia:Pageviews]]it should be done in [[meta:Pageviews]]
You should also check the work already done by LeonWeber. It's live in de: The JS code is in http://de.wikipedia.org/wiki/MediaWiki:Pagecounter.js The tool is running on the toolserver. http://pgcount.wikimedia.de/ is an alias.
You can see the top100 queries at http://tools.wikimedia.de/~leon/stats/pageview/temp.php It's beta but will be released in the next days. He'll implement it for other wikis... after vactions.
Only showing top doesn't reveal lees seen pages (and remember that seen != watched)
Another problem could be avoiding spamming the view list.
On Thu, Jul 27, 2006 at 04:33:33PM +0200, Platonides wrote:
It's a good thing. But instead of discussing on [[en:Wikipedia:Pageviews]]it should be done in [[meta:Pageviews]]
Cool. I'll move it.
You should also check the work already done by LeonWeber. It's live in de: The JS code is in http://de.wikipedia.org/wiki/MediaWiki:Pagecounter.js The tool is running on the toolserver. http://pgcount.wikimedia.de/ is an alias.
I've heard about this.
You can see the top100 queries at http://tools.wikimedia.de/~leon/stats/pageview/temp.php It's beta but will be released in the next days. He'll implement it for other wikis... after vactions.
I was thinking that it would be very useful and interesting if: 1) This data was kept in a WM databases, so it could be accessed on any page's "history". 2) For admins and (perhaps) logged-in users other tools were available that would rank search queries, category listings, etc. by page ranking.
Not *trivial*. But doable.
Only showing top doesn't reveal lees seen pages (and remember that seen != watched)
Exactly. It would be nice to see it on a page-by-page basis. However, it looks like past the top 100 the resolution is pretty low (small quantities can't carry much information). Would it be worthwhile to count a larger percentage of the pageviews so that we can see the minor differences between page rank #1000 and page rank #10,000? Could this method scale up to 5% of pageviews, 10%, etc.?
Another problem could be avoiding spamming the view list.
I'm not clear what you mean by this.
-Erik
Another problem could be avoiding spamming the view list.
I'm not clear what you mean by this.
-Erik
We talked about it on the irc. Leon (user) put the second on the top100 the Leon (article) with a tiny script in minutes. So it's not very reliable.
Keeping the information on the WM servers should be done very carefully. The page counting could also affect the wikis flow (did you realize images are kept on other server so if it's overloaded, you can still read the articles?). The automatic page counting is disabled on wikipedia for the load it would give. Archiving on the db should be discarded i think, at least on the *same* db. There's also little need for the full replication of statistic data.
What would be neat would be having the tools recognizing the users. Configuring tools. the deletions browser available for those with deletedhistory, configuring with monobooks.js... Hopefully, the single login should give some external auth system.
wikitech-l@lists.wikimedia.org