emijrp wrote:
Hi Frederic, thanks for your work. Have you tested 7z?
It makes no difference to me. River suggested (and installed) xz, so I used it, but 7z would have worked too.
A quick test using my biased data for one day (but it should be representative enough):
$ du -s * 1027260 7z 1004 M, 25.27% saved 1374804 gz 1.4 G, 0% saved 1020692 xz 997 M, 25.75% saved
The difference between xz and 7z is negligible (<1%). I haven't benchmarked anything formally, but 7z was much faster on my system. It looks like this is mainly because the software can use several cores simultaneously.
We can compress to xz while the new disks arrive. I read that it is about 24 TB, so, we can revert to gzip in the future.
Is there any particular reason to use gzip ? When I use these files, I mostly uncompress them on the fly from Perl, and there is a module to do this with zx too (haven't tested it, though). I am sure Python and other languages can do the same.
Even if we have plenty of space, it makes sense to use xz (or another format that offers good compression) and to benefit from the size reduction, for example if/when these files are backuped or moved around. Also, I'd like to be able to provide the files for download for those people who want local copies [several academic groups have already requested them], and the 25% size reduction is a big bonus here too.
But as I wrote earlier, these files are mostly archived on the toolserver, and I assume that most users don't dig often through the older ones, so that the best compression should not be a problem.
A better file format (e.g. one file per day, with separate data for 24 hours, and another file with data aggregated per day) is probably what is most needed for "real uses" -- as far as I know, this is how Erik Zachte handles this data. A databae would be best, of course, but requires much more work...
As always, comments are very welcome.
Frédéric