DaB. wrote:
I am following up on a discussion here in October:
I run a script that downloads 1200 MB every night
if you do this, please save the data at
/mnt/user-store/
(create a directoy there). So every usercan use the data and they have to downloaded one 1 time.
Since I am becoming involved with statistics too, I have setup such a scheme in /mnt/user-store/stats. Data files starting from 1 October 2008 are currently available (emijrp asked if I could get older files too, which should be doable but I haven't looked into it yet). I still have to fine tune the update process, but basically a cron task will take care of this at least every day (probably more often, but I have to see when the original files are actually updated)
Let me know if anyone else is interested in using this data.
Perhaps there is a better way (rsync or something) to get the data from the source.
I use wget; it will not download files twice unless they have been modified (which should not happen). Also, files are already gz'ipped, so compression would not be of much use here. Even though rsync is a better solution on paper, all in all, I don't think it would improve the situation much here.
Currently, the directory contains 112 Go, growing by about 1.2 Gb everyday. So far, it is not a problem (2.5 Tb are currently available in user-store), but I'd like to know when it would start to be considered "too big". What do the admins think ?
On the main statistics server of the WMF, Erik Zachte is developing scripts to compact these individual hourly files into daily files, reducing the size of the data by two; this could also be used here.
Frédéric