Adding MD5 / SHA1 column to revision table (discussing r94289)

List overview All Threads
Download

newer

older

Extracting text from Wikipedia

Html code

Diederik van Liere

18 Aug 2011 18 Aug '11

3:40 p.m.

Hi! I am starting this thread because Brion's revision r94289 reverted r94289 [0] stating "core schema change with no discussion" [1]. Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash column (either md5 or sha1) in the revision table. The primary use case of this column will be to assist detecting reverts. I don't think that data integrity is the primary reason for adding this column. The huge advantage of having such a column is that it will not be longer necessary to analyze full dumps to detect reverts, instead you can look for reverts in the stub dump file by looking for the same hash within a single page. The fact that there is a theoretical chance of a collision is not very important IMHO, it would just mean that in very rare cases in our research we would flag an edit being reverted while it's not. The two bug reports contain quite long discussions and this feature has also been discussed internally quite extensively but oddly enough it hasn't happened yet on the mailinglist.

So let's have a discussion!

[0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289 [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541 [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860 [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312

Best,

Diederik

Show replies by date

Erik Moeller

3 Sep 3 Sep

12:34 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere dvanliere@gmail.com wrote:

...

Hi! I am starting this thread because Brion's revision r94289 reverted r94289 [0] stating "core schema change with no discussion" [1].

Bumping this: What are the remaining open questions regarding this schema change? As Diederik explains, this has a number of potential benefits especially for research/analysis, and I think it makes sense to target it for 1.19 or 1.20 unless there are scary implications that haven't been surfaced yet. :-)

-- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

Asher Feldman

1:20 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

Would it be possible to generate offline hashes for the bulk of our revision corpus via dumps and load that into prod to minimize the time and impact of the backfill?

When using for analysis, will we wish the new columns had partial indexes (first 6 characters?)

Is code written to populate rev_sha1 on each new edit?

On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere dvanliere@gmail.comwrote:

...

Hi! I am starting this thread because Brion's revision r94289 reverted r94289 [0] stating "core schema change with no discussion" [1]. Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash column (either md5 or sha1) in the revision table. The primary use case of this column will be to assist detecting reverts. I don't think that data integrity is the primary reason for adding this column. The huge advantage of having such a column is that it will not be longer necessary to analyze full dumps to detect reverts, instead you can look for reverts in the stub dump file by looking for the same hash within a single page. The fact that there is a theoretical chance of a collision is not very important IMHO, it would just mean that in very rare cases in our research we would flag an edit being reverted while it's not. The two bug reports contain quite long discussions and this feature has also been discussed internally quite extensively but oddly enough it hasn't happened yet on the mailinglist.

So let's have a discussion!

[0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289 [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541 [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860 [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312

Best,

Diederik

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Daniel Friesen

1:47 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

Bug 2939 is one relevant bug to this, it could probably use an index.

[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

On 11-09-02 05:20 PM, Asher Feldman wrote:

...

Would it be possible to generate offline hashes for the bulk of our revision corpus via dumps and load that into prod to minimize the time and impact of the backfill?

When using for analysis, will we wish the new columns had partial indexes (first 6 characters?)

Is code written to populate rev_sha1 on each new edit?

On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere dvanliere@gmail.comwrote:

...
Hi! I am starting this thread because Brion's revision r94289 reverted r94289 [0] stating "core schema change with no discussion" [1]. Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash column (either md5 or sha1) in the revision table. The primary use case of this column will be to assist detecting reverts. I don't think that data integrity is the primary reason for adding this column. The huge advantage of having such a column is that it will not be longer necessary to analyze full dumps to detect reverts, instead you can look for reverts in the stub dump file by looking for the same hash within a single page. The fact that there is a theoretical chance of a collision is not very important IMHO, it would just mean that in very rare cases in our research we would flag an edit being reverted while it's not. The two bug reports contain quite long discussions and this feature has also been discussed internally quite extensively but oddly enough it hasn't happened yet on the mailinglist.

So let's have a discussion!

[0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289 [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541 [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860 [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312

Best,

Diederik

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Rob Lanphier

5:33 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen lists@nadir-seen-fire.com wrote:

...

On 11-09-02 05:20 PM, Asher Feldman wrote:

...
When using for analysis, will we wish the new columns had partial indexes (first 6 characters?)

...

Bug 2939 is one relevant bug to this, it could probably use an index. [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939

I generally suspect that a standard index is going to be a waste for the most urgent uses of this. It will rarely be interesting to search for common hashes between articles. The far more common case will be to search for duplicate hashes within the history of a single article. My understanding is that having a normal index on a table the size of our revision table will be far too expensive for db writes.

This is the first I've heard of partial indexes (/me researches) I don't know if a partial index is going to be cheap enough that we can use it, and useful enough that we'd want to. Would this be a faster query in a world with a partial index on the first six characters? SELECT rev_id FROM revision WHERE rev_page=12345 AND rev_sha1='4cdbd80be15fcfff139fb8a95f2ca359520939ee'

...or would we have to run a query like this to get the benefit of the index? SELECT rev_id FROM revision WHERE rev_page=12345 AND rev_sha1 like '4cdbd8%'

...and would either of these queries be considered too expensive to run without a partial index? How about with a partial index?

Rob

Daniel Friesen

6:16 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On 11-09-02 09:33 PM, Rob Lanphier wrote:

...

On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen lists@nadir-seen-fire.com wrote:

...
On 11-09-02 05:20 PM, Asher Feldman wrote:

...
When using for analysis, will we wish the new columns had partial indexes (first 6 characters?)

Bug 2939 is one relevant bug to this, it could probably use an index. [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939

My understanding is that having a normal index on a table the size of our revision table will be far too expensive for db writes. ... Rob

We've got 5 normal indexes on revision: - A unique int+int - A binary(14) - An int+binary(14) - Another int+binary(14) - And a varchar(255)+binary(14)

That bug wise a (rev_page,rev_sha1) or (rev_page,rev_timestamp,rev_sha1) may do.

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Diederik van Liere

10:13 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

Hi,

I've suggested to generate bulk checksums as well but both Brion and Ariel see the primary purpose of this field to check the validity of the dump generating process and so they want to generate the checksums straight from the external storage.

In a general sense, there are two use cases for this new field: 1) Checking the validity of the XML dump files 2) Identifying reverts

I have started to work on a proposal for deployment (and while being incomplete) it might be a good start to further plan the deployment. I have been trying to come up with some back-of-the-envelope calculations about how much time and space it would take but I don't have all the required information yet to come up with some reasonable estimates.

You can find the proposal here: http://strategy.wikimedia.org/wiki/Proposal:Implement_and_deploy_checksum_re...

I want to thank Brion and Asher for giving feedback on prior drafts. Please feel free to improve this proposal.

Best, Diederik

PS: not sure if this proposal should be on strategy or mediawiki...

On 2011-09-03, at 7:16 AM, Daniel Friesen wrote:

...

On 11-09-02 09:33 PM, Rob Lanphier wrote:

...
On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen lists@nadir-seen-fire.com wrote:

...
On 11-09-02 05:20 PM, Asher Feldman wrote:

...
When using for analysis, will we wish the new columns had partial indexes (first 6 characters?)

Bug 2939 is one relevant bug to this, it could probably use an index. [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939

My understanding is that having a normal index on a table the size of our revision table will be far too expensive for db writes. ... Rob

We've got 5 normal indexes on revision:

A unique int+int

A binary(14)

An int+binary(14)

Another int+binary(14)

And a varchar(255)+binary(14)

That bug wise a (rev_page,rev_sha1) or (rev_page,rev_timestamp,rev_sha1) may do.

-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

MZMcBride

4 Sep 4 Sep

2:03 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

Diederik van Liere wrote:

...

I've suggested to generate bulk checksums as well but both Brion and Ariel see the primary purpose of this field to check the validity of the dump generating process and so they want to generate the checksums straight from the external storage.

[...]

PS: not sure if this proposal should be on strategy or mediawiki...

I think standard practice nowadays is a subpage of http://www.mediawiki.org/wiki/Requests_for_comment.

MZMcBride

Krinkle

2:29 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

2011/9/4 MZMcBride z@mzmcbride.com

...

Diederik van Liere wrote:

...
I've suggested to generate bulk checksums as well but both Brion and

Ariel see

...
the primary purpose of this field to check the validity of the dump

generating

...
process and so they want to generate the checksums straight from the

external

...
storage.

[...]

PS: not sure if this proposal should be on strategy or mediawiki...

I think standard practice nowadays is a subpage of http://www.mediawiki.org/wiki/Requests_for_comment.

MZMcBride

Indeed. Moved: http://mediawiki.org/wiki/Requests_for_comment/Database_field_for_checksum_o...

– Krinkle

Diederik van Liere

9:23 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

Thanks for moving the page. Diederik

On 2011-09-04, at 3:29 PM, Krinkle wrote:

...

2011/9/4 MZMcBride z@mzmcbride.com

...
Diederik van Liere wrote:

...
I've suggested to generate bulk checksums as well but both Brion and

Ariel see

...
the primary purpose of this field to check the validity of the dump

generating

...
process and so they want to generate the checksums straight from the

external

...
storage.

[...]

PS: not sure if this proposal should be on strategy or mediawiki...

I think standard practice nowadays is a subpage of http://www.mediawiki.org/wiki/Requests_for_comment.

MZMcBride

Indeed. Moved: http://mediawiki.org/wiki/Requests_for_comment/Database_field_for_checksum_o...

– Krinkle _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Thomas Gries

16 Sep 16 Sep

7:15 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

RE: http://www.mediawiki.org/wiki/Requests_for_comment/Database_field_for_checks...

Recently, Adding MD5 / SHA1 column to revision table (discussing r94289) was discussed.

For some applications, I use the technique of representing the 128 bit of md5 or other checksums

as base-62 character strings instead of hexadecimal (base-16) strings.

When you need a non-binary, i.e. character representation for displaying, storing or transmitting checksums, you should always consider to use base-62 instead of base-16 representations (for md5 you need 22 bytes instead of 32 bytes).

When you encode the 128 bits of MD5 (example) in base-16 aka /hexadecimal/, you need CHAR(32). When you use the technique of enoding the 128 bits in /base-62/ with characters from [A-Za-z0-9], you'll need CHAR(22).

See http://de.php.net/manual/en/function.md5.php#83321 for one possible implementation.

I found this for example first in Microsofts free FCIV program which creates checksum xml files and uses this shorter representation.

Roan Kattouw

10:24 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On Fri, Sep 16, 2011 at 8:15 AM, Thomas Gries mail@tgries.de wrote:

...

For some applications, I use the technique of representing the 128 bit of md5 or other checksums

as base-62 character strings instead of hexadecimal (base-16) strings.

MediaWiki already uses a similar technique, storing SHA-1 hashes of images in base 36.

Roan

Thomas Gries

5:48 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

Am 16.09.2011 11:24, schrieb Roan Kattouw:

...

For some applications, I use the technique of representing the 128 bit of md5 or other checksums
   as base-62 character strings
   instead of hexadecimal (base-16) strings.

...

MediaWiki already uses a similar technique, storing SHA-1 hashes of images in base 36.

Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?

base 16 = CHAR(32) md5 sum in hexadecimal base 36 = CHAR(25) base 62 = CHAR(22)

using base 62 (upper and lower ASCII letters and digits) would save 3 bytes over base 36. if my calculator works correctly ;-)

Neil Kandalgaonkar

6:05 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On 9/16/11 9:48 AM, Thomas Gries wrote:

...

Am 16.09.2011 11:24, schrieb Roan Kattouw:

...
For some applications, I use the technique of representing the 128 bit of md5 or other checksums
    as base-62 character strings
    instead of hexadecimal (base-16) strings.
...
MediaWiki already uses a similar technique, storing SHA-1 hashes of images in base 36.

Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?

11M images in commons x 3 bytes = 33MB / ~1MB per file on average = a savings equivalent to 33 more files in Commons

In the time it took you to write this we probably had more than 33 files uploaded.

-- Neil Kandalgaonkar (| neilk@wikimedia.org

Daniel Friesen

6:10 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On 11-09-16 09:48 AM, Thomas Gries wrote:

...

Am 16.09.2011 11:24, schrieb Roan Kattouw:

...
For some applications, I use the technique of representing the 128 bit of md5 or other checksums
   as base-62 character strings
   instead of hexadecimal (base-16) strings.
MediaWiki already uses a similar technique, storing SHA-1 hashes of images in base 36.
Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?

base 16 = CHAR(32) md5 sum in hexadecimal base 36 = CHAR(25) base 62 = CHAR(22)

using base 62 (upper and lower ASCII letters and digits) would save 3 bytes over base 36. if my calculator works correctly ;-)

If we're picking apart checksum sizes in database storage. Why not just go all the way and store the binary data as binary with a BINARY(16) 16, 36, 62... what happened to base 64 anyways? php even has native code for that.

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

Brion Vibber

6:19 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On Fri, Sep 16, 2011 at 9:48 AM, Thomas Gries mail@tgries.de wrote:

...

Am 16.09.2011 11:24, schrieb Roan Kattouw:

...
For some applications, I use the technique of representing the 128 bit of md5 or other checksums
   as base-62 character strings
   instead of hexadecimal (base-16) strings.
...
MediaWiki already uses a similar technique, storing SHA-1 hashes of images in base 36.

Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?

This format was chosen for hashes to be used as filenames for uploaded file storage (currently used only for storing deleted files, I think, but there's long been a long-term plan to switch primary image storage to this as well some day).

For greatest compatibility with all filesystems, we only use characters that are safe (ASCII digits and letters) and don't rely on case distinctions which are not always preserved (Windows and Mac OS X systems default to case-insensitive filesystems).

The reason we're not using hex here is that a more compact representation makes the filenames, and thus any URL references including them in the path, shorter. On img_sha1 I guess we just kept using it for compatibility/similarity with the deleted file archives?

-- brion

Roan Kattouw

17 Sep 17 Sep

4:26 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On Fri, Sep 16, 2011 at 6:48 PM, Thomas Gries mail@tgries.de wrote:

...

Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?

I don't know, this was way, way before my time. But then, why use base 62 if you can use base 64? Encoders/decoders for that are much more widely available. PHP's base_convert() will go up to 36, I think, so you'd have to write a base 62 encoder/decoder yourself, but PHP has built-in functions to deal with base 64.

...

base 16 = CHAR(32) md5 sum in hexadecimal base 36 = CHAR(25) base 62 = CHAR(22)

Minor detail: I think it's more likely we'll use SHA-1 hashes rather than MD5 hashes.

Roan Kattouw (Catrope)

Platonides

10:52 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

Roan Kattouw wrote:

...

On Fri, Sep 16, 2011 at 6:48 PM, Thomas Griesmail@tgries.de wrote:

...
Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?

I don't know, this was way, way before my time. But then, why use base 62 if you can use base 64? Encoders/decoders for that are much more widely available. PHP's base_convert() will go up to 36, I think, so you'd have to write a base 62 encoder/decoder yourself, but PHP has built-in functions to deal with base 64.

Brion explained it on previous mail. And yes, we have our own converter in MediaWiki.

Robert Rohde

11:46 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On Sat, Sep 17, 2011 at 8:26 AM, Roan Kattouw roan.kattouw@gmail.com wrote:

...

Minor detail: I think it's more likely we'll use SHA-1 hashes rather than MD5 hashes.

Is there a good reason to prefer SHA-1?

Both have weaknesses allowing one to construct a collision (with considerable effort), but I wouldn't see why that would matter for the proposed use.

With only about 1 billion revisions in the collective databases, the odds of an accidental collision with either MD5 or SHA-1 is infinitesimal (less than 1 in 10^18 for the weaker MD5).

MD5 is shorter and in my experience about 25% faster to compute.

Personally I've tended to view MD5 as more than good enough in offline analyses.

-Robert Rohde

Anthony

18 Sep 18 Sep

12:56 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde rarohde@gmail.com wrote:

...

Is there a good reason to prefer SHA-1?

Both have weaknesses allowing one to construct a collision (with considerable effort)

Considerable effort? I can create an MD5 collision in a few minutes on my home computer. Is there anything even remotely like this for SHA-1?

...

MD5 is shorter and in my experience about 25% faster to compute.

Personally I've tended to view MD5 as more than good enough in offline analyses.

For offline analyses, there's no need to change the online database tables.

Robert Rohde

6:55 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On Sat, Sep 17, 2011 at 4:56 PM, Anthony wikimail@inbox.org wrote:

...

On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde rarohde@gmail.com wrote:

...
Is there a good reason to prefer SHA-1?

Both have weaknesses allowing one to construct a collision (with considerable effort)

Considerable effort? I can create an MD5 collision in a few minutes on my home computer. Is there anything even remotely like this for SHA-1?

If I've been keeping up to date, the collision complexity for MD5 is about 2^21 operations, and runs in a few seconds (not minutes); and for SHA-1 down to about 2^52 with current results. The latter represents about 100 cpu-years, which is within the realm of supercomputers. That time will probably continue to come down if people find ways to improve the attacks on SHA-1. (The existing attacks usually require the ability to feed arbitrary binary strings into the hash function. Given that both browsers and Mediawiki will tend to reject binary data placed in an edit window, I'm not sure if any of the existing attacks could be reliably applied to Mediawiki editing.)

If collision attacks really matter we should use SHA-1. However, do any of the proposed use cases care about whether someone might intentionally inject a collision? In the proposed uses I've looked at it, it seems irrelevant. The intentional collision will get flagged as a revert and the text leading to that collision would be discarded. How is that a bad thing?

It's a not a big deal, but if I understand prior comments correctly, most of the existing offline infrastructure uses MD5, so I'm wondering if there is a distinct use case for favoring SHA-1.

...

...
MD5 is shorter and in my experience about 25% faster to compute.

Personally I've tended to view MD5 as more than good enough in offline analyses.

For offline analyses, there's no need to change the online database tables.

Need? That's debatable, but one of the major motivators is the desire to have hash values in database dumps (both for revert checks and for checksums on correct data import / export). Both of those are "offline" uses, but it is beneficial to have that information precomputed and stored rather than frequently regenerated.

-Robert Rohde

Ariel T. Glenn

7:33 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

Στις 17-09-2011, ημέρα Σαβ, και ώρα 22:55 -0700, ο/η Robert Rohde έγραψε:

...

On Sat, Sep 17, 2011 at 4:56 PM, Anthony wikimail@inbox.org wrote:

<snip>

...

...
For offline analyses, there's no need to change the online database tables.

Need? That's debatable, but one of the major motivators is the desire to have hash values in database dumps (both for revert checks and for checksums on correct data import / export). Both of those are "offline" uses, but it is beneficial to have that information precomputed and stored rather than frequently regenerated.

If we don't have it in the online database tables, this defeats the purpose of having the value in there at all, for the purpose of generating the XML dumps.

Recall that the dumps are generated in two passes; in the first pass we retrieve from the db and record all of the metadata about revisions, and in the second (time-comsuming) pass we re-use the text of the revisions from a previous dump file if the text is in there. We want to compare the has of that text against what the online database says the hash is; if they don't match, we want to fetch the live copy.

I refer folks to bug 23264 [1] as proof that mismatch between the metadata and the text has crept in in the past; changes to MW code in other places than the backups scripts can cause quite subtle breakage.

Ariel

[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=23264

Anthony

9:57 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On Sun, Sep 18, 2011 at 2:33 AM, Ariel T. Glenn ariel@wikimedia.org wrote:

...

Στις 17-09-2011, ημέρα Σαβ, και ώρα 22:55 -0700, ο/η Robert Rohde έγραψε:

...
On Sat, Sep 17, 2011 at 4:56 PM, Anthony wikimail@inbox.org wrote:

<snip>

...
...
For offline analyses, there's no need to change the online database tables.

Need? That's debatable, but one of the major motivators is the desire to have hash values in database dumps (both for revert checks and for checksums on correct data import / export). Both of those are "offline" uses, but it is beneficial to have that information precomputed and stored rather than frequently regenerated.

If we don't have it in the online database tables, this defeats the purpose of having the value in there at all, for the purpose of generating the XML dumps.

Recall that the dumps are generated in two passes; in the first pass we retrieve from the db and record all of the metadata about revisions, and in the second (time-comsuming) pass we re-use the text of the revisions from a previous dump file if the text is in there. We want to compare the has of that text against what the online database says the hash is; if they don't match, we want to fetch the live copy.

Well, this is exactly the type of use in which collisions do matter. Do you really want the dump to not record the correct data when some miscreant creates an intentional collision?

Ilmari Karonen

9:20 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On 09/18/2011 08:55 AM, Robert Rohde wrote:

...

people find ways to improve the attacks on SHA-1. (The existing attacks usually require the ability to feed arbitrary binary strings into the hash function. Given that both browsers and Mediawiki will tend to reject binary data placed in an edit window, I'm not sure if any of the existing attacks could be reliably applied to Mediawiki editing.)

I'm pretty sure MediaWiki will accept any data that's valid UTF-8, modulo canonicalization perhaps. I'm not very familiar with the MD5 and SHA-1 collision attacks, but I wouldn't be surprised if at least some of them could be modified to use, say, only 7-bit ASCII.

...

If collision attacks really matter we should use SHA-1. However, do any of the proposed use cases care about whether someone might intentionally inject a collision? In the proposed uses I've looked at it, it seems irrelevant. The intentional collision will get flagged as a revert and the text leading to that collision would be discarded. How is that a bad thing?

Well, if you could predict the content of a version that someone (say, a bot) was likely to save sometime in the future, and created a different revision with the same hash (say, in the sandbox or in your userspace, so that people wouldn't notice it) in advance...

Depending on just what page was targeted, the consequences could range from a minor annoyance to site-wide JS injection.

Anyway, I wouldn't suggest using either MD5 or SHA-1: both have known attacks, and it's a fundamental rule of cryptography that attacks always get better over time, never worse. Let's _at least_ use SHA-2.

(Actually, I'd suggest designing the format so that we can change hash functions in the future without having to rehash every old revision immediately. For example, we might store a hash computed using SHA-256 as "sha256:d9014c4624844aa..." or something like that.)

-- Ilmari Karonen

Aryeh Gregor

4 Sep 4 Sep

7:01 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On Sat, Sep 3, 2011 at 12:33 AM, Rob Lanphier robla@wikimedia.org wrote:

...

I generally suspect that a standard index is going to be a waste for the most urgent uses of this. It will rarely be interesting to search for common hashes between articles. The far more common case will be to search for duplicate hashes within the history of a single article. My understanding is that having a normal index on a table the size of our revision table will be far too expensive for db writes.

If it's useful for Wikimedia and the index is expensive to maintain, the index can be kept on only one slave per DB cluster to minimize cost, the way we've always done for RC. If it's not useful for Wikimedia, only third-party researchers, then we could leave it unindexed on the cluster and have an index only on the toolserver or whatnot.

It's not so much that an extra index would be untenably expensive in principle, more that it's not worth the cost if it's not used for anything important. If it's just for statistical analysis and not for the live site, there's no reason to have it on all servers, except maybe administrative simplicity. The toolserver already has a bunch of indexes that Wikimedia doesn't, for exactly this sort of reason.

...

This is the first I've heard of partial indexes (/me researches) I don't know if a partial index is going to be cheap enough that we can use it, and useful enough that we'd want to. Would this be a faster query in a world with a partial index on the first six characters? SELECT rev_id FROM revision WHERE rev_page=12345 AND rev_sha1='4cdbd80be15fcfff139fb8a95f2ca359520939ee'

I'd think so, yes. MySQL should be smart enough to use partial indexes at least that far -- otherwise there'd be no point in supporting them. I'd think an index on the first several bytes should be almost as effective as one on the whole value, if you just want to do filtering or joining (not sorting). It might be more effective, since more would fit in RAM. However, I don't know for sure offhand.

...

...and would either of these queries be considered too expensive to run without a partial index? How about with a partial index?

Without a suitable index, running either of these queries would scan the entire revision history of the article in question. That would certainly not be acceptable on the cluster if the query is to be run with any frequency on arbitrary articles. It would be okay if it were just an occasional thing run on a limited number of articles, or only on articles with few revisions in their histories.

All of the above should be interpreted with the understanding that I stand a decent chance of knowing what I'm talking about, but have no say at this point and shouldn't be taken too seriously. :)

Roan Kattouw

3 Sep 3 Sep

10:51 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On Sat, Sep 3, 2011 at 2:20 AM, Asher Feldman afeldman@wikimedia.org wrote:

...

Is code written to populate rev_sha1 on each new edit?

I believe that was part of Aaron's code that got reverted, yes.

Offline generation of hashes is definitely possible, but the only reason you'd do it is to minimize the time during which some rows have an empty hash. It's not strictly offline in that the text will still have to be pulled from ES, but it can be done before the field is added.

Roan

Brion Vibber

27 Nov 27 Nov

10:29 p.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

So... this seems to have snuck back in a month ago: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021

https://bugzilla.wikimedia.org/show_bug.cgi?id=21860

Have we resolved the deployment questions on how to actually do the change? Just want to make sure ops has plenty of warning before 1.19 comes down the pipe. (Especially if we have to revert anything back to 1.18 during/after!)

-- brion

Tim Starling

28 Nov 28 Nov

2:28 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

On 28/11/11 08:29, Brion Vibber wrote:

...

So... this seems to have snuck back in a month ago: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021

https://bugzilla.wikimedia.org/show_bug.cgi?id=21860

I don't think it really "snuck", Rob has been talking about it for a while, see e.g. comment 27.

...

Have we resolved the deployment questions on how to actually do the change? Just want to make sure ops has plenty of warning before 1.19 comes down the pipe. (Especially if we have to revert anything back to 1.18 during/after!)

It can be deployed like any column addition to a large table: on the slaves first, then switch masters, then on the old masters. For 1.17 we changed categorylinks (60M rows on enwiki), and that caused no problems. In 1.18 the schema changes were done by ops (Asher), and included flaggedrevs which is 30M rows on dewiki.

The revision table is 320M rows on enwiki, but it doesn't pose any special challenges, as long as there's enough disk space. The snapshot host db26 is the only host which may possibly be in danger of running out of space, but if its snapshots are deleted and the space reallocated to /a then it won't have any trouble.

Like the previous schema changes, this schema change will be done in advance of the software version change. The old version will work with the new schema, and the default value is harmless, so reverting back to 1.18 or restarting the populate script won't be a problem.

-- Tim Starling

John Erling Blad

4:13 a.m.

New subject: Adding MD5 / SHA1 column to revision table (discussing r94289)

I have no idea about the schema changes, but to choose a digest for detection of identity reverts is pretty simple. The really difficult part is to choose a locally sensitive hash or fingerprint that works for very similar revisions with a lot of content.

I would propose that the digest is stored in the database, and that a lsh or fingerprint is calculated on the fly by the API, unless someone can find a really good way to make and store a lsh or fingerprint that has all necessary properties.

For all the purposes I know (and care) about the digest will be used for detection of identity reverts, while the lsh/fingerprint will be used for resynchronization after difficult partly reverts. In addition it seems likely that fingerprints are necessary for more fine-grained analysis.

It seems like the necessary properties for lsh and the fingerprint scales with increasing content, that makes it difficult to precompute a value.

John

On Mon, Nov 28, 2011 at 2:28 AM, Tim Starling tstarling@wikimedia.org wrote:

...

On 28/11/11 08:29, Brion Vibber wrote:

...
So... this seems to have snuck back in a month ago: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021

https://bugzilla.wikimedia.org/show_bug.cgi?id=21860

I don't think it really "snuck", Rob has been talking about it for a while, see e.g. comment 27.

...
Have we resolved the deployment questions on how to actually do the change? Just want to make sure ops has plenty of warning before 1.19 comes down the pipe. (Especially if we have to revert anything back to 1.18 during/after!)

It can be deployed like any column addition to a large table: on the slaves first, then switch masters, then on the old masters. For 1.17 we changed categorylinks (60M rows on enwiki), and that caused no problems. In 1.18 the schema changes were done by ops (Asher), and included flaggedrevs which is 30M rows on dewiki.

The revision table is 320M rows on enwiki, but it doesn't pose any special challenges, as long as there's enough disk space. The snapshot host db26 is the only host which may possibly be in danger of running out of space, but if its snapshots are deleted and the space reallocated to /a then it won't have any trouble.

Like the previous schema changes, this schema change will be done in advance of the software version change. The old version will work with the new schema, and the default value is harmless, so reverting back to 1.18 or restarting the populate script won't be a problem.

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

4742

Age (days ago)

4844

Last active (days ago)

wikitech-l@lists.wikimedia.org

28 comments

19 participants

tags (0)

participants (19)

Anthony
Ariel T. Glenn
Aryeh Gregor
Asher Feldman
Brion Vibber
Daniel Friesen
Diederik van Liere
Erik Moeller
Ilmari Karonen
John Erling Blad
Krinkle
MZMcBride
Neil Kandalgaonkar
Platonides
Roan Kattouw
Rob Lanphier
Robert Rohde
Thomas Gries
Tim Starling