Hi all!
I just finished doing a few very rough unscientific comparisons of data sizes and hive
query times between uncompressed and snappy compressed webrequest data stored in HDFS.
Check it!
https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Hive
Quick summary: Snappy compresses webrequest JSON data to about 25% of original size. Query
times on small datasets (~1 hour) are doubled, but query times are larger datasets are
only slightly increased.
The Camus Snappy compression (which was merged upstream at LinkedIn this week) is working
great. I'll start using it exclusively for webrequest imports soon.
-Ao