Indexing sha1 hashes in mysql

List overview All Threads
Download

newer

older

MediaWiki 1.20 release candidate

Re: [Wikitech-l] Can we kill...

Asher Feldman

25 Sep 2012 25 Sep '12

12:54 p.m.

As we've increased our use of sha1 hashes to identify unique content over the past year, I occasionally see changesets or discussions about indexing sha1's in mysql. When indexing a text field, it's generally beneficial to define the smallest index that still uniquely matches a high percentage of rows. Search and insert performance both benefit from the space savings.

As a cryptographic hash function, sha1 has a very high degree of uniformity. We can estimate the percent of partial index look-ups that will match a unique result just by comparing the size of the table to the space covered by the index.

sha1 hashes are 160bits, which mediawiki stores in mysql with base36 encoding. base36(2^160) == "twj4yidkw7a8pn4g709kzmfoaol3x8g". Looking at enwiki.revision.rev_sha1, the smallest current value is 000002xi72hkkhn1nvfdeffgp7e1w3s and the largest, twj4yi9tgesxysgyi41bz16jdkwroha.

The number of combinations covered by indexing the top bits represented by the left-most 4 thru 10 characters:

sha1_index(4) = 1395184 (twj4) sha1_index(5) = 50226658 (twj4y) sha1_index(6) = 1808159706 (twj4yi) sha1_index(7) = 65093749429 (twj4yid) sha1_index(8) = 2343374979464 (twj4yidk) sha1_index(9) = 84361499260736 (twj4yidkw) sha1_index(10) = 3037013973386503 (twj4yidkw7)

percentage of unique matches in a table of 2B sha1's:

sha1_index(7) = 96.92% sha1_index(8) = 99.91% sha1_index(9) = 99.997% sha1_index(10) = 99.9999%

percentage of unique matches in a table of 10B sha1's:

sha1_index(8) = 99.573% sha1_index(9) = 99.988% sha1_index(10) = 99.9996%

Given current table sizes and growth rates, an 8 character index on a sha1 column should be sufficient for years for many cases (i.e. media files outside of commons, revisions on projects outside of the top 10), while a 10 character index still provides >99.99% coverage of 100 billion sha1's.

Caveat: The likely but rare worst case for a partial index is that we may have tables with hundreds of rows containing the same sha1, perhaps revisions of a page that had a crazy revert war. A lookup for that specific sha1 will have to do secondary lookups for each match, as would lookups of any other sha1 that happens to collide within the index space. If the index is large enough to make the later case quite unlikely, prudent use of caching can address the first.

<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1 hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.

-Asher

Show replies by date

Tyler Romeo

25 Sep 25 Sep

12:57 p.m.

I see no problem with this. SHA-1 has such a strong avalanche effect that even the chance of having two similar hashes is pretty low.

*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

On Tue, Sep 25, 2012 at 1:54 PM, Asher Feldman afeldman@wikimedia.orgwrote:

...

As we've increased our use of sha1 hashes to identify unique content over the past year, I occasionally see changesets or discussions about indexing sha1's in mysql. When indexing a text field, it's generally beneficial to define the smallest index that still uniquely matches a high percentage of rows. Search and insert performance both benefit from the space savings.

As a cryptographic hash function, sha1 has a very high degree of uniformity. We can estimate the percent of partial index look-ups that will match a unique result just by comparing the size of the table to the space covered by the index.

sha1 hashes are 160bits, which mediawiki stores in mysql with base36 encoding. base36(2^160) == "twj4yidkw7a8pn4g709kzmfoaol3x8g". Looking at enwiki.revision.rev_sha1, the smallest current value is 000002xi72hkkhn1nvfdeffgp7e1w3s and the largest, twj4yi9tgesxysgyi41bz16jdkwroha.

The number of combinations covered by indexing the top bits represented by the left-most 4 thru 10 characters:

sha1_index(4) = 1395184 (twj4) sha1_index(5) = 50226658 (twj4y) sha1_index(6) = 1808159706 (twj4yi) sha1_index(7) = 65093749429 (twj4yid) sha1_index(8) = 2343374979464 (twj4yidk) sha1_index(9) = 84361499260736 (twj4yidkw) sha1_index(10) = 3037013973386503 (twj4yidkw7)

percentage of unique matches in a table of 2B sha1's:

sha1_index(7) = 96.92% sha1_index(8) = 99.91% sha1_index(9) = 99.997% sha1_index(10) = 99.9999%

percentage of unique matches in a table of 10B sha1's:

sha1_index(8) = 99.573% sha1_index(9) = 99.988% sha1_index(10) = 99.9996%

Given current table sizes and growth rates, an 8 character index on a sha1 column should be sufficient for years for many cases (i.e. media files outside of commons, revisions on projects outside of the top 10), while a 10 character index still provides >99.99% coverage of 100 billion sha1's.

Caveat: The likely but rare worst case for a partial index is that we may have tables with hundreds of rows containing the same sha1, perhaps revisions of a page that had a crazy revert war. A lookup for that specific sha1 will have to do secondary lookups for each match, as would lookups of any other sha1 that happens to collide within the index space. If the index is large enough to make the later case quite unlikely, prudent use of caching can address the first.

<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1 hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.

-Asher _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Artur Fijałkowski

1:20 p.m.

...

<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1 hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.

Isn't it better to store BIGINT containing part of (binary) sha1 and use index on numeric column?

AJF/WarX

Tyler Romeo

1:41 p.m.

It would be better, but I believe MediaWiki already uses this type of storage. Changing to binary would require a schema change.

*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

On Tue, Sep 25, 2012 at 2:20 PM, Artur Fijałkowski wiki.warx@gmail.comwrote:

...

...
<tl;dr> Where an index is desired on a mysql column of base36 encoded

sha1

...
hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.

Isn't it better to store BIGINT containing part of (binary) sha1 and use index on numeric column?

AJF/WarX

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Asher Feldman

4:12 p.m.

Base36 certainly isn't the most efficient way to store a sha1, but it's what is in use all over mediawiki. I think there was some discussion on this list of the tradeoffs of different methods when revision.rev_sha1 was added, and base36 was picked as a compromise. I don't know why base36 was picked over base62 once it was decided to stick with an ascii alpha-numeric encoding but regardless, there was opposition to binary. Taken on its own, an integer index would be more efficient but I don't think it makes sense if we continue using base36.

On Tue, Sep 25, 2012 at 11:20 AM, Artur Fijałkowski wiki.warx@gmail.comwrote:

...

...
<tl;dr> Where an index is desired on a mysql column of base36 encoded

sha1

...
hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.

Isn't it better to store BIGINT containing part of (binary) sha1 and use index on numeric column?

AJF/WarX

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Platonides

5:30 p.m.

On 25/09/12 23:12, Asher Feldman wrote:

...

Base36 certainly isn't the most efficient way to store a sha1, but it's what is in use all over mediawiki. I think there was some discussion on this list of the tradeoffs of different methods when revision.rev_sha1 was added, and base36 was picked as a compromise. I don't know why base36 was picked over base62 once it was decided to stick with an ascii alpha-numeric encoding but regardless, there was opposition to binary. Taken on its own, an integer index would be more efficient but I don't think it makes sense if we continue using base36.

We started using base36 for storing deleted files. The advantage of base36 is that it's shorter than plain hex, but it can be stored -without collisions- in a case insensitive filesystem.

Aran Dunkley

27 Sep 27 Sep

8:56 a.m.

New subject: Awful trouble with 1.19 adding to html

Hello, does anyone here know why the parser insists on adding tags to HTML results returned by parser functions?

I'm trying to upgrade the TreeAndMenu extension to work with MediaWiki 1.19 and I can't get the parser to stop adding tags throughout the result returned by the parser-function expansion.

I've said isHTML = true and tried setting noparse to true and many other things, but what ever I do - even removing all whitespace, it still adds 's!

Isarra Yos

2:39 p.m.

New subject: Awful trouble with 1.19 adding to html

On 27/09/2012 07:56, Aran Dunkley wrote:

...

Hello, does anyone here know why the parser insists on adding tags to HTML results returned by parser functions?

I'm trying to upgrade the TreeAndMenu extension to work with MediaWiki 1.19 and I can't get the parser to stop adding tags throughout the result returned by the parser-function expansion.

I've said isHTML = true and tried setting noparse to true and many other things, but what ever I do - even removing all whitespace, it still adds

's!

Is it inserting extra linebreaks? I don't really know anything about what you're working with specifically, but general wikitext can be really thrown off by an unexpected linebreak.

-- -— Isarra

Aran Dunkley

3:23 p.m.

New subject: Awful trouble with 1.19 adding to html

I removed all line breaks from the code being text being returned - but the confusing thing is that the content being returned is marked as isHTML (not wikitext) so the parser should be leaving it alone completely and not checking for line breaks.

On 27/09/12 16:39, Isarra Yos wrote:

...

On 27/09/2012 07:56, Aran Dunkley wrote:

...
Hello, does anyone here know why the parser insists on adding tags to HTML results returned by parser functions?

I'm trying to upgrade the TreeAndMenu extension to work with MediaWiki 1.19 and I can't get the parser to stop adding tags throughout the result returned by the parser-function expansion.

I've said isHTML = true and tried setting noparse to true and many other things, but what ever I do - even removing all whitespace, it still adds

's!

Is it inserting extra linebreaks? I don't really know anything about what you're working with specifically, but general wikitext can be really thrown off by an unexpected linebreak.

Thomas Gries

6:11 p.m.

New subject: Awful trouble with 1.19 adding to html

Am 27.09.2012 22:23, schrieb Aran Dunkley:

...

I removed all line breaks from the code being text being returned - but the confusing thing is that the content being returned is marked as isHTML (not wikitext) so the parser should be leaving it alone completely and not checking for line breaks.

On 27/09/12 16:39, Isarra Yos wrote:

...
On 27/09/2012 07:56, Aran Dunkley wrote:

...
Hello, does anyone here know why the parser insists on adding tags to HTML results returned by parser functions?

I'm trying to upgrade the TreeAndMenu extension to work with MediaWiki 1.19 and I can't get the parser to stop adding tags throughout the result returned by the parser-function expansion.

I remember that I had similar problems with the parser adding unwanted tags under special circumstances when my extension writes(adds) HTML to the output. https://bugzilla.wikimedia.org/show_bug.cgi?id=19226 Parser renders UI messages with up to one newline different from more

Tim Starling

25 Sep 25 Sep

9:11 p.m.

On 26/09/12 03:54, Asher Feldman wrote:

...

<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1 hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.

We usually don't put update patches in MediaWiki for index size changes. You can change the indexes on existing Wikimedia wikis at your leisure, and we will change it in tables.sql for the benefit of new wikis.

-- Tim Starling

Tim Starling

11:12 p.m.

On 26/09/12 12:11, Tim Starling wrote:

...

On 26/09/12 03:54, Asher Feldman wrote:

...
<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1 hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.

We usually don't put update patches in MediaWiki for index size changes. You can change the indexes on existing Wikimedia wikis at your leisure, and we will change it in tables.sql for the benefit of new wikis.

Done in https://gerrit.wikimedia.org/r/#/c/25218/

-- Tim Starling

4480

Age (days ago)

4482

Last active (days ago)

wikitech-l@lists.wikimedia.org

11 comments

8 participants

tags (0)

participants (8)

Aran Dunkley
Artur Fijałkowski
Asher Feldman
Isarra Yos
Platonides
Thomas Gries
Tim Starling
Tyler Romeo