I am following up on a discussion here in October:
I run a script
that downloads 1200 MB every night
if you do this, please save the data at
(create a directoy there). So every usercan use the data and they have to
downloaded one 1 time.
Since I am becoming involved with statistics too, I have setup such a
scheme in /mnt/user-store/stats. Data files starting from 1 October
2008 are currently available (emijrp asked if I could get older files
too, which should be doable but I haven't looked into it yet). I still
have to fine tune the update process, but basically a cron task will
take care of this at least every day (probably more often, but I have to
see when the original files are actually updated)
Let me know if anyone else is interested in using this data.
Perhaps there is a better way (rsync or something) to
get the data from the
I use wget; it will not download files twice unless they have been
modified (which should not happen). Also, files are already gz'ipped, so
compression would not be of much use here. Even though rsync is a better
solution on paper, all in all, I don't think it would improve the
situation much here.
Currently, the directory contains 112 Go, growing by about 1.2 Gb
everyday. So far, it is not a problem (2.5 Tb are currently available in
user-store), but I'd like to know when it would start to be considered
"too big". What do the admins think ?
On the main statistics server of the WMF, Erik Zachte is developing
scripts to compact these individual hourly files into daily files,
reducing the size of the data by two; this could also be used here.