As we've increased our use of sha1 hashes to identify unique content over
the past year, I occasionally see changesets or discussions about indexing
sha1's in mysql. When indexing a text field, it's generally beneficial to
define the smallest index that still uniquely matches a high percentage of
rows. Search and insert performance both benefit from the space savings.
As a cryptographic hash function, sha1 has a very high degree of
uniformity. We can estimate the percent of partial index look-ups that
will match a unique result just by comparing the size of the table to the
space covered by the index.
sha1 hashes are 160bits, which mediawiki stores in mysql with base36
encoding. base36(2^160) == "twj4yidkw7a8pn4g709kzmfoaol3x8g". Looking at
enwiki.revision.rev_sha1, the smallest current value is
000002xi72hkkhn1nvfdeffgp7e1w3s and the largest,
twj4yi9tgesxysgyi41bz16jdkwroha.
The number of combinations covered by indexing the top bits represented by
the left-most 4 thru 10 characters:
sha1_index(4) = 1395184 (twj4)
sha1_index(5) = 50226658 (twj4y)
sha1_index(6) = 1808159706 (twj4yi)
sha1_index(7) = 65093749429 (twj4yid)
sha1_index(8) = 2343374979464 (twj4yidk)
sha1_index(9) = 84361499260736 (twj4yidkw)
sha1_index(10) = 3037013973386503 (twj4yidkw7)
percentage of unique matches in a table of 2B sha1's:
sha1_index(7) = 96.92%
sha1_index(8) = 99.91%
sha1_index(9) = 99.997%
sha1_index(10) = 99.9999%
percentage of unique matches in a table of 10B sha1's:
sha1_index(8) = 99.573%
sha1_index(9) = 99.988%
sha1_index(10) = 99.9996%
Given current table sizes and growth rates, an 8 character index on a sha1
column should be sufficient for years for many cases (i.e. media files
outside of commons, revisions on projects outside of the top 10), while a
10 character index still provides >99.99% coverage of 100 billion sha1's.
Caveat: The likely but rare worst case for a partial index is that we may
have tables with hundreds of rows containing the same sha1, perhaps
revisions of a page that had a crazy revert war. A lookup for that
specific sha1 will have to do secondary lookups for each match, as would
lookups of any other sha1 that happens to collide within the index space.
If the index is large enough to make the later case quite unlikely, prudent
use of caching can address the first.
<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1
hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be
sufficient in many cases, but this is still provides a >2/3 space savings
while covering a huge (2^51.43) space.
-Asher