I see no problem with this. SHA-1 has such a strong avalanche effect that even the chance of having two similar hashes is pretty low.
*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On Tue, Sep 25, 2012 at 1:54 PM, Asher Feldman afeldman@wikimedia.orgwrote:
As we've increased our use of sha1 hashes to identify unique content over the past year, I occasionally see changesets or discussions about indexing sha1's in mysql. When indexing a text field, it's generally beneficial to define the smallest index that still uniquely matches a high percentage of rows. Search and insert performance both benefit from the space savings.
As a cryptographic hash function, sha1 has a very high degree of uniformity. We can estimate the percent of partial index look-ups that will match a unique result just by comparing the size of the table to the space covered by the index.
sha1 hashes are 160bits, which mediawiki stores in mysql with base36 encoding. base36(2^160) == "twj4yidkw7a8pn4g709kzmfoaol3x8g". Looking at enwiki.revision.rev_sha1, the smallest current value is 000002xi72hkkhn1nvfdeffgp7e1w3s and the largest, twj4yi9tgesxysgyi41bz16jdkwroha.
The number of combinations covered by indexing the top bits represented by the left-most 4 thru 10 characters:
sha1_index(4) = 1395184 (twj4) sha1_index(5) = 50226658 (twj4y) sha1_index(6) = 1808159706 (twj4yi) sha1_index(7) = 65093749429 (twj4yid) sha1_index(8) = 2343374979464 (twj4yidk) sha1_index(9) = 84361499260736 (twj4yidkw) sha1_index(10) = 3037013973386503 (twj4yidkw7)
percentage of unique matches in a table of 2B sha1's:
sha1_index(7) = 96.92% sha1_index(8) = 99.91% sha1_index(9) = 99.997% sha1_index(10) = 99.9999%
percentage of unique matches in a table of 10B sha1's:
sha1_index(8) = 99.573% sha1_index(9) = 99.988% sha1_index(10) = 99.9996%
Given current table sizes and growth rates, an 8 character index on a sha1 column should be sufficient for years for many cases (i.e. media files outside of commons, revisions on projects outside of the top 10), while a 10 character index still provides >99.99% coverage of 100 billion sha1's.
Caveat: The likely but rare worst case for a partial index is that we may have tables with hundreds of rows containing the same sha1, perhaps revisions of a page that had a crazy revert war. A lookup for that specific sha1 will have to do secondary lookups for each match, as would lookups of any other sha1 that happens to collide within the index space. If the index is large enough to make the later case quite unlikely, prudent use of caching can address the first.
<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1 hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.
-Asher _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l