Bjoern Hoehrmann wrote:
When making
http://katograph.appspot.com/ which
renders the german Wiki-
pedia category system as an interactive "treemap" based on information
like number of articles in them and requests during a 3 day period, I
found that the proxy logs used for stats.grok.se are rather unreliable,
with many of the "top" pages being inplausible (articles on not very
notable subjects that have existed only for a very short time show up in
the top ten, for instance). On
http://stats.grok.se/en/top you can see
this aswell, 40 million views for `Special:Export/Robert L. Bradley, Jr`
is rather implausible, as far as human users are concerned.
Yes, the data is susceptible to manipulation, both intentional and
unintentional. As I said, this was a first-pass implementation on Domas'
part. As far as I know, this hasn't been touched by anyone in years. You're
absolutely correct that, at the end of the day, until the data itself is
better (more reliable), the resulting tools/graphs/scripts/everything that
rely on it will be bound by its limitations.
MZMcBride wrote:
Is it worth a Toolserver user's time to try
to create a database of
per-project, per-page page view statistics? Is it worth a grant from the
Wikimedia Foundation to have someone work on this? Is it worth trying to
convince Wikimedia Deutschland to assign resources? And, of course, it
wouldn't be a bad idea if Domas' first-pass implementation was improved on
Wikimedia's side, regardless.
The data that powers stats.grok.se is available for download, it should
be rather trivial to feed it into toolserver databases and query it as
desired, ignoring performance problems.
Not simply performance. It's a lot of data and it needs to be indexed. That
has a real cost. There are also edge cases and corner cases (different
encodings of requests, etc.) that need to be accounted for. It's not a
particularly small undertaking, if it's to be done properly.
MZMcBride