Hi Frederic, thanks for your work. Have you tested 7z?
It makes no difference to me. River suggested (and installed) xz, so I
used it, but 7z would have worked too.
A quick test using my biased data for one day (but it should be
$ du -s *
1027260 7z 1004 M, 25.27% saved
1374804 gz 1.4 G, 0% saved
1020692 xz 997 M, 25.75% saved
The difference between xz and 7z is negligible (<1%). I haven't
benchmarked anything formally, but 7z was much faster on my system. It
looks like this is mainly because the software can use several cores
We can compress to xz while the new disks arrive. I
read that it is
about 24 TB, so, we can revert to gzip in the future.
Is there any particular reason to use gzip ? When I use these files, I
mostly uncompress them on the fly from Perl, and there is a module to do
this with zx too (haven't tested it, though). I am sure Python and other
languages can do the same.
Even if we have plenty of space, it makes sense to use xz (or another
format that offers good compression) and to benefit from the size
reduction, for example if/when these files are backuped or moved around.
Also, I'd like to be able to provide the files for download for those
people who want local copies [several academic groups have already
requested them], and the 25% size reduction is a big bonus here too.
But as I wrote earlier, these files are mostly archived on the
toolserver, and I assume that most users don't dig often through the
older ones, so that the best compression should not be a problem.
A better file format (e.g. one file per day, with separate data for 24
hours, and another file with data aggregated per day) is probably what
is most needed for "real uses" -- as far as I know, this is how Erik
Zachte handles this data. A databae would be best, of course, but
requires much more work...
As always, comments are very welcome.