I think it is extremely important to keep these files for later analysis by historians and others.
Mathias Schindler also keep an archive or at least did till April (Berlin conference). He even bought a dedicated external drive for it.
I collect files daily and merge 24 hourly files into one daily file. That saves a lot on disk space and makes processing faster. Titles with less than 10 requests per day are discarded that also saves a lot.
For the remainder instead of 24 comma separated values I use a 'sparse array' as follows:
B2D15G2 means 2 views in 2nd hour (0100-0200), 15 in 4th, 2 in 7th The string starts with total for whole day (redundant but eases processing for some purposes) So actually it is 19B2D15G2
Example: de Berlie_Doherty 9L2O1Q1R2T3 de Berliet 20E2F1K1M1N2O3P3Q4R2X1 de Berliet_GBC_8_KT 17B1E1J3M2N1O1P1Q1R2S1T1U1V1 de Berlin 8488A116B56C32D56E21F43G98H172I316J531K636L675M601N533O524P508Q510R576S426T4 92U530V508W328X200
I have files from August 2008. Roughly 3 Gb per month now. And yes a more permanent, fail-safe and more accessible storage location would be great.
Erik Zachte
-----Original Message----- From: Frédéric Schütz [mailto:schutz@mathgen.ch] Sent: Thursday, September 17, 2009 22:34 To: toolserver-l@lists.wikimedia.org Cc: wikitech-l@lists.wikimedia.org; Erik Zachte Subject: Re: [Toolserver-l] Archive of visitor stats
Lars Aronsson wrote:
Are visitor stats (as produced by Domas) safely archived somewhere, for example on the toolserver, where development projects can easily access them for analysis? I have made my own copies of the files (I guess my plan was to use them, but this hasn't started yet), but now I'm running out of disk and I urgently need to clear some space on that server.
I just deleted September 2009 (last 2 weeks) and that freed 9 GB.
The oldest I have is pagecounts-20071209-180000.gz
As Platonides mentioned, they are in /mnt/user-store/stats on the toolserver; however, I would not call that "safely archived": one of my cron jobs just copies them from Domas server, and that's it.
At the moment, there should be everything starting from 1 January 2009 (although part of it disappeared at some point, but I managed to recover it).
However, this is definitively not a sustainable solution in the long run: the files currently take 335 Gb (out of a 1.5 Tb total space).
Erik Zachte stores archives of visitor stats in a better format, aggregating some of the older data and storing several days of data in one file. I started looking into these files earlier this year, planning to spend some time playing with this data. One of my ideas was to replicate the statistical data that is on the WMF stats server somewhere on the toolserver -- and do it "officially" and not just by copying files using a personal cron job. Unfortunately, "real life" took over and I did not manage to continue this (and still can't). However, if there is any interest in improving the situation, I'd be glad to look into it as soon as I can.
I cc' Erik who may have more to say.
Cheers,
Frédéric