[Toolserver-l] Compressing stats files better (was; Re: /mnt/user-store is full)

Frederic Schutz schutz at mathgen.ch
Mon Jan 3 16:08:40 UTC 2011


emijrp wrote:

> Hi Frederic, thanks for your work. Have you tested 7z?

It makes no difference to me. River suggested (and installed) xz, so I 
used it, but 7z would have worked too.

A quick test using my biased data for one day (but it should be 
representative enough):

$ du -s *
1027260	7z         1004 M, 25.27% saved
1374804	gz          1.4 G,     0% saved
1020692	xz          997 M, 25.75% saved

The difference between xz and 7z is negligible (<1%). I haven't 
benchmarked anything formally, but 7z was much faster on my system. It 
looks like this is mainly because the software can use several cores 
simultaneously.

> We can compress to xz while the new disks arrive. I read that it is 
> about 24 TB, so, we can revert to gzip in the future.

Is there any particular reason to use gzip ? When I use these files, I 
mostly uncompress them on the fly from Perl, and there is a module to do 
this with zx too (haven't tested it, though). I am sure Python and other 
languages can do the same.

Even if we have plenty of space, it makes sense to use xz (or another 
format that offers good compression) and to benefit from the size 
reduction, for example if/when these files are backuped or moved around. 
Also, I'd like to be able to provide the files for download for those 
people who want local copies [several academic groups have already 
requested them], and the 25% size reduction is a big bonus here too.

But as I wrote earlier, these files are mostly archived on the 
toolserver, and I assume that most users don't dig often through the 
older ones, so that the best compression should not be a problem.

A better file format (e.g. one file per day, with separate data for 24 
hours, and another file with data aggregated per day) is probably what 
is most needed for "real uses" -- as far as I know, this is how Erik 
Zachte handles this data. A databae would be best, of course, but 
requires much more work...

As always, comments are very welcome.

Frédéric



More information about the Toolserver-l mailing list