Sure, info gets lost. And the Long Tail is meaningful for some research no doubt. But my resources are finite.
Actually I do store some all inclusive counts in the compacted 24 hr file:
# Lines starting with ampersand (@) show totals per 'namespace' (including omitted counts for low traffic articles) # Since valid namespace string are not known in the compression script any string followed by colon (:) counts as possible namespace string # Please reconcile with real namespace name strings later # 'namespaces' with count < 5 are combined in 'Other' (on larger wikis these are surely false positives)
@ aa.z Category 9 @ aa.z File 20 @ aa.z Image 9 @ aa.z MediaWiki 20 @ aa.z NamespaceArticles 163 @ aa.z Special 97 @ aa.z Talk 17 @ aa.z User 35 @ aa.z Wikipedia 16 @ aa.z -other- 11
Erik Zachte
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l- bounces@lists.wikimedia.org] On Behalf Of Robert Rohde Sent: Friday, September 18, 2009 02:33 To: Wikimedia developers Cc: Mathias Schindler; Frédéric Schütz; toolserver- l@lists.wikimedia.org Subject: Re: [Wikitech-l] [Toolserver-l] Archive of visitor stats
2009/9/17 Erik Zachte erikzachte@infodisiac.com:
I think it is extremely important to keep these files for later
analysis by
historians and others.
Mathias Schindler also keep an archive or at least did till April
(Berlin
conference). He even bought a dedicated external drive for it.
I collect files daily and merge 24 hourly files into one daily file. That saves a lot on disk space and makes processing faster. Titles with less than 10 requests per day are discarded that also
saves a
lot.
Careful, a recent analysis I did suggested that 15% of all page requests for articles on Wikipedia are for topics requested less than once per hour. There are a very large number of pages that rarely see hits, but collectively the traffic to such topics is important. You could end up biasing certain kinds of analysis if you always exclude the rarely visited pages.
-Robert Rohde
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l