Snappy compression and Hive JSON data - Analytics

8 Jan 2014

Hi all!

I just finished doing a few very rough unscientific comparisons of data sizes and hive
query times between uncompressed and snappy compressed webrequest data stored in HDFS.

Check it!
https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Hive

Quick summary: Snappy compresses webrequest JSON data to about 25% of original size. Query
times on small datasets (~1 hour) are doubled, but query times are larger datasets are
only slightly increased. 

The Camus Snappy compression (which was merged upstream at LinkedIn this week) is working
great.  I'll start using it exclusively for webrequest imports soon.

-Ao