Bjoern Hoehrmann wrote:
When making http://katograph.appspot.com/ which renders the german Wiki- pedia category system as an interactive "treemap" based on information like number of articles in them and requests during a 3 day period, I found that the proxy logs used for stats.grok.se are rather unreliable, with many of the "top" pages being inplausible (articles on not very notable subjects that have existed only for a very short time show up in the top ten, for instance). On http://stats.grok.se/en/top you can see this aswell, 40 million views for `Special:Export/Robert L. Bradley, Jr` is rather implausible, as far as human users are concerned.
Yes, the data is susceptible to manipulation, both intentional and unintentional. As I said, this was a first-pass implementation on Domas' part. As far as I know, this hasn't been touched by anyone in years. You're absolutely correct that, at the end of the day, until the data itself is better (more reliable), the resulting tools/graphs/scripts/everything that rely on it will be bound by its limitations.
MZMcBride wrote:
Is it worth a Toolserver user's time to try to create a database of per-project, per-page page view statistics? Is it worth a grant from the Wikimedia Foundation to have someone work on this? Is it worth trying to convince Wikimedia Deutschland to assign resources? And, of course, it wouldn't be a bad idea if Domas' first-pass implementation was improved on Wikimedia's side, regardless.
The data that powers stats.grok.se is available for download, it should be rather trivial to feed it into toolserver databases and query it as desired, ignoring performance problems.
Not simply performance. It's a lot of data and it needs to be indexed. That has a real cost. There are also edge cases and corner cases (different encodings of requests, etc.) that need to be accounted for. It's not a particularly small undertaking, if it's to be done properly.
MZMcBride