Hi all!

I just finished doing a few very rough unscientific comparisons of data sizes and hive query times between uncompressed and snappy compressed webrequest data stored in HDFS.

Check it!

https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Hive

Quick summary: Snappy compresses webrequest JSON data to about 25% of original size. Query times on small datasets (~1 hour) are doubled, but query times are larger datasets are only slightly increased.

The Camus Snappy compression (which was merged upstream at LinkedIn this week) is working great. I'll start using it exclusively for webrequest imports soon.

-Ao