As we've increased our use of sha1 hashes to identify unique content over the past year, I occasionally see changesets or discussions about indexing sha1's in mysql. When indexing a text field, it's generally beneficial to define the smallest index that still uniquely matches a high percentage of rows. Search and insert performance both benefit from the space savings.
As a cryptographic hash function, sha1 has a very high degree of uniformity. We can estimate the percent of partial index look-ups that will match a unique result just by comparing the size of the table to the space covered by the index.
sha1 hashes are 160bits, which mediawiki stores in mysql with base36 encoding. base36(2^160) == "twj4yidkw7a8pn4g709kzmfoaol3x8g". Looking at enwiki.revision.rev_sha1, the smallest current value is 000002xi72hkkhn1nvfdeffgp7e1w3s and the largest, twj4yi9tgesxysgyi41bz16jdkwroha.
The number of combinations covered by indexing the top bits represented by the left-most 4 thru 10 characters:
sha1_index(4) = 1395184 (twj4) sha1_index(5) = 50226658 (twj4y) sha1_index(6) = 1808159706 (twj4yi) sha1_index(7) = 65093749429 (twj4yid) sha1_index(8) = 2343374979464 (twj4yidk) sha1_index(9) = 84361499260736 (twj4yidkw) sha1_index(10) = 3037013973386503 (twj4yidkw7)
percentage of unique matches in a table of 2B sha1's:
sha1_index(7) = 96.92% sha1_index(8) = 99.91% sha1_index(9) = 99.997% sha1_index(10) = 99.9999%
percentage of unique matches in a table of 10B sha1's:
sha1_index(8) = 99.573% sha1_index(9) = 99.988% sha1_index(10) = 99.9996%
Given current table sizes and growth rates, an 8 character index on a sha1 column should be sufficient for years for many cases (i.e. media files outside of commons, revisions on projects outside of the top 10), while a 10 character index still provides >99.99% coverage of 100 billion sha1's.
Caveat: The likely but rare worst case for a partial index is that we may have tables with hundreds of rows containing the same sha1, perhaps revisions of a page that had a crazy revert war. A lookup for that specific sha1 will have to do secondary lookups for each match, as would lookups of any other sha1 that happens to collide within the index space. If the index is large enough to make the later case quite unlikely, prudent use of caching can address the first.
<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1 hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.
-Asher
I see no problem with this. SHA-1 has such a strong avalanche effect that even the chance of having two similar hashes is pretty low.
*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On Tue, Sep 25, 2012 at 1:54 PM, Asher Feldman afeldman@wikimedia.orgwrote:
As we've increased our use of sha1 hashes to identify unique content over the past year, I occasionally see changesets or discussions about indexing sha1's in mysql. When indexing a text field, it's generally beneficial to define the smallest index that still uniquely matches a high percentage of rows. Search and insert performance both benefit from the space savings.
As a cryptographic hash function, sha1 has a very high degree of uniformity. We can estimate the percent of partial index look-ups that will match a unique result just by comparing the size of the table to the space covered by the index.
sha1 hashes are 160bits, which mediawiki stores in mysql with base36 encoding. base36(2^160) == "twj4yidkw7a8pn4g709kzmfoaol3x8g". Looking at enwiki.revision.rev_sha1, the smallest current value is 000002xi72hkkhn1nvfdeffgp7e1w3s and the largest, twj4yi9tgesxysgyi41bz16jdkwroha.
The number of combinations covered by indexing the top bits represented by the left-most 4 thru 10 characters:
sha1_index(4) = 1395184 (twj4) sha1_index(5) = 50226658 (twj4y) sha1_index(6) = 1808159706 (twj4yi) sha1_index(7) = 65093749429 (twj4yid) sha1_index(8) = 2343374979464 (twj4yidk) sha1_index(9) = 84361499260736 (twj4yidkw) sha1_index(10) = 3037013973386503 (twj4yidkw7)
percentage of unique matches in a table of 2B sha1's:
sha1_index(7) = 96.92% sha1_index(8) = 99.91% sha1_index(9) = 99.997% sha1_index(10) = 99.9999%
percentage of unique matches in a table of 10B sha1's:
sha1_index(8) = 99.573% sha1_index(9) = 99.988% sha1_index(10) = 99.9996%
Given current table sizes and growth rates, an 8 character index on a sha1 column should be sufficient for years for many cases (i.e. media files outside of commons, revisions on projects outside of the top 10), while a 10 character index still provides >99.99% coverage of 100 billion sha1's.
Caveat: The likely but rare worst case for a partial index is that we may have tables with hundreds of rows containing the same sha1, perhaps revisions of a page that had a crazy revert war. A lookup for that specific sha1 will have to do secondary lookups for each match, as would lookups of any other sha1 that happens to collide within the index space. If the index is large enough to make the later case quite unlikely, prudent use of caching can address the first.
<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1 hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.
-Asher _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1 hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.
Isn't it better to store BIGINT containing part of (binary) sha1 and use index on numeric column?
AJF/WarX
It would be better, but I believe MediaWiki already uses this type of storage. Changing to binary would require a schema change.
*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On Tue, Sep 25, 2012 at 2:20 PM, Artur Fijałkowski wiki.warx@gmail.comwrote:
<tl;dr> Where an index is desired on a mysql column of base36 encoded
sha1
hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.
Isn't it better to store BIGINT containing part of (binary) sha1 and use index on numeric column?
AJF/WarX
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Base36 certainly isn't the most efficient way to store a sha1, but it's what is in use all over mediawiki. I think there was some discussion on this list of the tradeoffs of different methods when revision.rev_sha1 was added, and base36 was picked as a compromise. I don't know why base36 was picked over base62 once it was decided to stick with an ascii alpha-numeric encoding but regardless, there was opposition to binary. Taken on its own, an integer index would be more efficient but I don't think it makes sense if we continue using base36.
On Tue, Sep 25, 2012 at 11:20 AM, Artur Fijałkowski wiki.warx@gmail.comwrote:
<tl;dr> Where an index is desired on a mysql column of base36 encoded
sha1
hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.
Isn't it better to store BIGINT containing part of (binary) sha1 and use index on numeric column?
AJF/WarX
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 25/09/12 23:12, Asher Feldman wrote:
Base36 certainly isn't the most efficient way to store a sha1, but it's what is in use all over mediawiki. I think there was some discussion on this list of the tradeoffs of different methods when revision.rev_sha1 was added, and base36 was picked as a compromise. I don't know why base36 was picked over base62 once it was decided to stick with an ascii alpha-numeric encoding but regardless, there was opposition to binary. Taken on its own, an integer index would be more efficient but I don't think it makes sense if we continue using base36.
We started using base36 for storing deleted files. The advantage of base36 is that it's shorter than plain hex, but it can be stored -without collisions- in a case insensitive filesystem.
Hello, does anyone here know why the parser insists on adding <p> tags to HTML results returned by parser functions?
I'm trying to upgrade the TreeAndMenu extension to work with MediaWiki 1.19 and I can't get the parser to stop adding <p> tags throughout the result returned by the parser-function expansion.
I've said isHTML = true and tried setting noparse to true and many other things, but what ever I do - even removing all whitespace, it still adds <p>'s!
On 27/09/2012 07:56, Aran Dunkley wrote:
Hello, does anyone here know why the parser insists on adding <p> tags to HTML results returned by parser functions?
I'm trying to upgrade the TreeAndMenu extension to work with MediaWiki 1.19 and I can't get the parser to stop adding <p> tags throughout the result returned by the parser-function expansion.
I've said isHTML = true and tried setting noparse to true and many other things, but what ever I do - even removing all whitespace, it still adds
<p>'s!
Is it inserting extra linebreaks? I don't really know anything about what you're working with specifically, but general wikitext can be really thrown off by an unexpected linebreak.
I removed all line breaks from the code being text being returned - but the confusing thing is that the content being returned is marked as isHTML (not wikitext) so the parser should be leaving it alone completely and not checking for line breaks.
On 27/09/12 16:39, Isarra Yos wrote:
On 27/09/2012 07:56, Aran Dunkley wrote:
Hello, does anyone here know why the parser insists on adding <p> tags to HTML results returned by parser functions?
I'm trying to upgrade the TreeAndMenu extension to work with MediaWiki 1.19 and I can't get the parser to stop adding <p> tags throughout the result returned by the parser-function expansion.
I've said isHTML = true and tried setting noparse to true and many other things, but what ever I do - even removing all whitespace, it still adds
<p>'s!
Is it inserting extra linebreaks? I don't really know anything about what you're working with specifically, but general wikitext can be really thrown off by an unexpected linebreak.
Am 27.09.2012 22:23, schrieb Aran Dunkley:
I removed all line breaks from the code being text being returned - but the confusing thing is that the content being returned is marked as isHTML (not wikitext) so the parser should be leaving it alone completely and not checking for line breaks.
On 27/09/12 16:39, Isarra Yos wrote:
On 27/09/2012 07:56, Aran Dunkley wrote:
Hello, does anyone here know why the parser insists on adding <p> tags to HTML results returned by parser functions?
I'm trying to upgrade the TreeAndMenu extension to work with MediaWiki 1.19 and I can't get the parser to stop adding <p> tags throughout the result returned by the parser-function expansion.
I remember that I had similar problems with the parser adding unwanted <p> tags under special circumstances when my extension writes(adds) HTML to the output. https://bugzilla.wikimedia.org/show_bug.cgi?id=19226 Parser renders UI messages with up to one newline different from more
On 26/09/12 03:54, Asher Feldman wrote:
<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1 hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.
We usually don't put update patches in MediaWiki for index size changes. You can change the indexes on existing Wikimedia wikis at your leisure, and we will change it in tables.sql for the benefit of new wikis.
-- Tim Starling
On 26/09/12 12:11, Tim Starling wrote:
On 26/09/12 03:54, Asher Feldman wrote:
<tl;dr> Where an index is desired on a mysql column of base36 encoded sha1 hashes, I recommend ADD INDEX (sha1column(10)). Shorter indexes will be sufficient in many cases, but this is still provides a >2/3 space savings while covering a huge (2^51.43) space.
We usually don't put update patches in MediaWiki for index size changes. You can change the indexes on existing Wikimedia wikis at your leisure, and we will change it in tables.sql for the benefit of new wikis.
Done in https://gerrit.wikimedia.org/r/#/c/25218/
-- Tim Starling
wikitech-l@lists.wikimedia.org