Hi! I am starting this thread because Brion's revision r94289 reverted r94289 [0] stating "core schema change with no discussion" [1]. Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash column (either md5 or sha1) in the revision table. The primary use case of this column will be to assist detecting reverts. I don't think that data integrity is the primary reason for adding this column. The huge advantage of having such a column is that it will not be longer necessary to analyze full dumps to detect reverts, instead you can look for reverts in the stub dump file by looking for the same hash within a single page. The fact that there is a theoretical chance of a collision is not very important IMHO, it would just mean that in very rare cases in our research we would flag an edit being reverted while it's not. The two bug reports contain quite long discussions and this feature has also been discussed internally quite extensively but oddly enough it hasn't happened yet on the mailinglist.
So let's have a discussion!
[0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289 [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541 [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860 [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312
Best,
Diederik
On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere dvanliere@gmail.com wrote:
Hi! I am starting this thread because Brion's revision r94289 reverted r94289 [0] stating "core schema change with no discussion" [1].
Bumping this: What are the remaining open questions regarding this schema change? As Diederik explains, this has a number of potential benefits especially for research/analysis, and I think it makes sense to target it for 1.19 or 1.20 unless there are scary implications that haven't been surfaced yet. :-)
Would it be possible to generate offline hashes for the bulk of our revision corpus via dumps and load that into prod to minimize the time and impact of the backfill?
When using for analysis, will we wish the new columns had partial indexes (first 6 characters?)
Is code written to populate rev_sha1 on each new edit?
On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere dvanliere@gmail.comwrote:
Hi! I am starting this thread because Brion's revision r94289 reverted r94289 [0] stating "core schema change with no discussion" [1]. Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash column (either md5 or sha1) in the revision table. The primary use case of this column will be to assist detecting reverts. I don't think that data integrity is the primary reason for adding this column. The huge advantage of having such a column is that it will not be longer necessary to analyze full dumps to detect reverts, instead you can look for reverts in the stub dump file by looking for the same hash within a single page. The fact that there is a theoretical chance of a collision is not very important IMHO, it would just mean that in very rare cases in our research we would flag an edit being reverted while it's not. The two bug reports contain quite long discussions and this feature has also been discussed internally quite extensively but oddly enough it hasn't happened yet on the mailinglist.
So let's have a discussion!
[0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289 [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541 [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860 [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312
Best,
Diederik
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Bug 2939 is one relevant bug to this, it could probably use an index.
[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
On 11-09-02 05:20 PM, Asher Feldman wrote:
Would it be possible to generate offline hashes for the bulk of our revision corpus via dumps and load that into prod to minimize the time and impact of the backfill?
When using for analysis, will we wish the new columns had partial indexes (first 6 characters?)
Is code written to populate rev_sha1 on each new edit?
On Thu, Aug 18, 2011 at 7:40 AM, Diederik van Liere dvanliere@gmail.comwrote:
Hi! I am starting this thread because Brion's revision r94289 reverted r94289 [0] stating "core schema change with no discussion" [1]. Bugs 21860 [2] and 25312 [3] advocate for the inclusion of a hash column (either md5 or sha1) in the revision table. The primary use case of this column will be to assist detecting reverts. I don't think that data integrity is the primary reason for adding this column. The huge advantage of having such a column is that it will not be longer necessary to analyze full dumps to detect reverts, instead you can look for reverts in the stub dump file by looking for the same hash within a single page. The fact that there is a theoretical chance of a collision is not very important IMHO, it would just mean that in very rare cases in our research we would flag an edit being reverted while it's not. The two bug reports contain quite long discussions and this feature has also been discussed internally quite extensively but oddly enough it hasn't happened yet on the mailinglist.
So let's have a discussion!
[0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94289 [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/94541 [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=21860 [3] https://bugzilla.wikimedia.org/show_bug.cgi?id=25312
Best,
Diederik
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen lists@nadir-seen-fire.com wrote:
On 11-09-02 05:20 PM, Asher Feldman wrote:
When using for analysis, will we wish the new columns had partial indexes (first 6 characters?)
Bug 2939 is one relevant bug to this, it could probably use an index. [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939
I generally suspect that a standard index is going to be a waste for the most urgent uses of this. It will rarely be interesting to search for common hashes between articles. The far more common case will be to search for duplicate hashes within the history of a single article. My understanding is that having a normal index on a table the size of our revision table will be far too expensive for db writes.
This is the first I've heard of partial indexes (/me researches) I don't know if a partial index is going to be cheap enough that we can use it, and useful enough that we'd want to. Would this be a faster query in a world with a partial index on the first six characters? SELECT rev_id FROM revision WHERE rev_page=12345 AND rev_sha1='4cdbd80be15fcfff139fb8a95f2ca359520939ee'
...or would we have to run a query like this to get the benefit of the index? SELECT rev_id FROM revision WHERE rev_page=12345 AND rev_sha1 like '4cdbd8%'
...and would either of these queries be considered too expensive to run without a partial index? How about with a partial index?
Rob
On 11-09-02 09:33 PM, Rob Lanphier wrote:
On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen lists@nadir-seen-fire.com wrote:
On 11-09-02 05:20 PM, Asher Feldman wrote:
When using for analysis, will we wish the new columns had partial indexes (first 6 characters?)
Bug 2939 is one relevant bug to this, it could probably use an index. [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939
My understanding is that having a normal index on a table the size of our revision table will be far too expensive for db writes. ... Rob
We've got 5 normal indexes on revision: - A unique int+int - A binary(14) - An int+binary(14) - Another int+binary(14) - And a varchar(255)+binary(14)
That bug wise a (rev_page,rev_sha1) or (rev_page,rev_timestamp,rev_sha1) may do.
Hi,
I've suggested to generate bulk checksums as well but both Brion and Ariel see the primary purpose of this field to check the validity of the dump generating process and so they want to generate the checksums straight from the external storage.
In a general sense, there are two use cases for this new field: 1) Checking the validity of the XML dump files 2) Identifying reverts
I have started to work on a proposal for deployment (and while being incomplete) it might be a good start to further plan the deployment. I have been trying to come up with some back-of-the-envelope calculations about how much time and space it would take but I don't have all the required information yet to come up with some reasonable estimates.
You can find the proposal here: http://strategy.wikimedia.org/wiki/Proposal:Implement_and_deploy_checksum_re...
I want to thank Brion and Asher for giving feedback on prior drafts. Please feel free to improve this proposal.
Best, Diederik
PS: not sure if this proposal should be on strategy or mediawiki...
On 2011-09-03, at 7:16 AM, Daniel Friesen wrote:
On 11-09-02 09:33 PM, Rob Lanphier wrote:
On Fri, Sep 2, 2011 at 5:47 PM, Daniel Friesen lists@nadir-seen-fire.com wrote:
On 11-09-02 05:20 PM, Asher Feldman wrote:
When using for analysis, will we wish the new columns had partial indexes (first 6 characters?)
Bug 2939 is one relevant bug to this, it could probably use an index. [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=2939
My understanding is that having a normal index on a table the size of our revision table will be far too expensive for db writes. ... Rob
We've got 5 normal indexes on revision:
- A unique int+int
- A binary(14)
- An int+binary(14)
- Another int+binary(14)
- And a varchar(255)+binary(14)
That bug wise a (rev_page,rev_sha1) or (rev_page,rev_timestamp,rev_sha1) may do.
-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Diederik van Liere wrote:
I've suggested to generate bulk checksums as well but both Brion and Ariel see the primary purpose of this field to check the validity of the dump generating process and so they want to generate the checksums straight from the external storage.
[...]
PS: not sure if this proposal should be on strategy or mediawiki...
I think standard practice nowadays is a subpage of http://www.mediawiki.org/wiki/Requests_for_comment.
MZMcBride
2011/9/4 MZMcBride z@mzmcbride.com
Diederik van Liere wrote:
I've suggested to generate bulk checksums as well but both Brion and
Ariel see
the primary purpose of this field to check the validity of the dump
generating
process and so they want to generate the checksums straight from the
external
storage.
[...]
PS: not sure if this proposal should be on strategy or mediawiki...
I think standard practice nowadays is a subpage of http://www.mediawiki.org/wiki/Requests_for_comment.
MZMcBride
Indeed. Moved: http://mediawiki.org/wiki/Requests_for_comment/Database_field_for_checksum_o...
– Krinkle
Thanks for moving the page. Diederik
On 2011-09-04, at 3:29 PM, Krinkle wrote:
2011/9/4 MZMcBride z@mzmcbride.com
Diederik van Liere wrote:
I've suggested to generate bulk checksums as well but both Brion and
Ariel see
the primary purpose of this field to check the validity of the dump
generating
process and so they want to generate the checksums straight from the
external
storage.
[...]
PS: not sure if this proposal should be on strategy or mediawiki...
I think standard practice nowadays is a subpage of http://www.mediawiki.org/wiki/Requests_for_comment.
MZMcBride
Indeed. Moved: http://mediawiki.org/wiki/Requests_for_comment/Database_field_for_checksum_o...
– Krinkle _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
RE: http://www.mediawiki.org/wiki/Requests_for_comment/Database_field_for_checks...
Recently, Adding MD5 / SHA1 column to revision table (discussing r94289) was discussed.
For some applications, I use the technique of representing the 128 bit of md5 or other checksums
as base-62 character strings instead of hexadecimal (base-16) strings.
When you need a non-binary, i.e. character representation for displaying, storing or transmitting checksums, you should always consider to use base-62 instead of base-16 representations (for md5 you need 22 bytes instead of 32 bytes).
When you encode the 128 bits of MD5 (example) in base-16 aka /hexadecimal/, you need CHAR(32). When you use the technique of enoding the 128 bits in /base-62/ with characters from [A-Za-z0-9], you'll need CHAR(22).
See http://de.php.net/manual/en/function.md5.php#83321 for one possible implementation.
I found this for example first in Microsofts free FCIV program which creates checksum xml files and uses this shorter representation.
On Fri, Sep 16, 2011 at 8:15 AM, Thomas Gries mail@tgries.de wrote:
For some applications, I use the technique of representing the 128 bit of md5 or other checksums
as base-62 character strings instead of hexadecimal (base-16) strings.
MediaWiki already uses a similar technique, storing SHA-1 hashes of images in base 36.
Roan
Am 16.09.2011 11:24, schrieb Roan Kattouw:
For some applications, I use the technique of representing the 128 bit of md5 or other checksums
as base-62 character strings instead of hexadecimal (base-16) strings.
MediaWiki already uses a similar technique, storing SHA-1 hashes of images in base 36.
Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?
base 16 = CHAR(32) md5 sum in hexadecimal base 36 = CHAR(25) base 62 = CHAR(22)
using base 62 (upper and lower ASCII letters and digits) would save 3 bytes over base 36. if my calculator works correctly ;-)
On 9/16/11 9:48 AM, Thomas Gries wrote:
Am 16.09.2011 11:24, schrieb Roan Kattouw:
For some applications, I use the technique of representing the 128 bit of md5 or other checksums
as base-62 character strings instead of hexadecimal (base-16) strings.
MediaWiki already uses a similar technique, storing SHA-1 hashes of images in base 36.
Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?
11M images in commons x 3 bytes = 33MB / ~1MB per file on average = a savings equivalent to 33 more files in Commons
In the time it took you to write this we probably had more than 33 files uploaded.
On 11-09-16 09:48 AM, Thomas Gries wrote:
Am 16.09.2011 11:24, schrieb Roan Kattouw:
For some applications, I use the technique of representing the 128 bit of md5 or other checksums
as base-62 character strings instead of hexadecimal (base-16) strings.
MediaWiki already uses a similar technique, storing SHA-1 hashes of images in base 36.
Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?
base 16 = CHAR(32) md5 sum in hexadecimal base 36 = CHAR(25) base 62 = CHAR(22)
using base 62 (upper and lower ASCII letters and digits) would save 3 bytes over base 36. if my calculator works correctly ;-)
If we're picking apart checksum sizes in database storage. Why not just go all the way and store the binary data as binary with a BINARY(16) 16, 36, 62... what happened to base 64 anyways? php even has native code for that.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
On Fri, Sep 16, 2011 at 9:48 AM, Thomas Gries mail@tgries.de wrote:
Am 16.09.2011 11:24, schrieb Roan Kattouw:
For some applications, I use the technique of representing the 128 bit of md5 or other checksums
as base-62 character strings instead of hexadecimal (base-16) strings.
MediaWiki already uses a similar technique, storing SHA-1 hashes of images in base 36.
Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?
This format was chosen for hashes to be used as filenames for uploaded file storage (currently used only for storing deleted files, I think, but there's long been a long-term plan to switch primary image storage to this as well some day).
For greatest compatibility with all filesystems, we only use characters that are safe (ASCII digits and letters) and don't rely on case distinctions which are not always preserved (Windows and Mac OS X systems default to case-insensitive filesystems).
The reason we're not using hex here is that a more compact representation makes the filenames, and thus any URL references including them in the path, shorter. On img_sha1 I guess we just kept using it for compatibility/similarity with the deleted file archives?
-- brion
On Fri, Sep 16, 2011 at 6:48 PM, Thomas Gries mail@tgries.de wrote:
Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?
I don't know, this was way, way before my time. But then, why use base 62 if you can use base 64? Encoders/decoders for that are much more widely available. PHP's base_convert() will go up to 36, I think, so you'd have to write a base 62 encoder/decoder yourself, but PHP has built-in functions to deal with base 64.
base 16 = CHAR(32) md5 sum in hexadecimal base 36 = CHAR(25) base 62 = CHAR(22)
Minor detail: I think it's more likely we'll use SHA-1 hashes rather than MD5 hashes.
Roan Kattouw (Catrope)
Roan Kattouw wrote:
On Fri, Sep 16, 2011 at 6:48 PM, Thomas Griesmail@tgries.de wrote:
Was there a certain reason to chose base 36 ? Why not recoding to base 62 and saving 3 bytes per checksum ?
I don't know, this was way, way before my time. But then, why use base 62 if you can use base 64? Encoders/decoders for that are much more widely available. PHP's base_convert() will go up to 36, I think, so you'd have to write a base 62 encoder/decoder yourself, but PHP has built-in functions to deal with base 64.
Brion explained it on previous mail. And yes, we have our own converter in MediaWiki.
On Sat, Sep 17, 2011 at 8:26 AM, Roan Kattouw roan.kattouw@gmail.com wrote:
Minor detail: I think it's more likely we'll use SHA-1 hashes rather than MD5 hashes.
Is there a good reason to prefer SHA-1?
Both have weaknesses allowing one to construct a collision (with considerable effort), but I wouldn't see why that would matter for the proposed use.
With only about 1 billion revisions in the collective databases, the odds of an accidental collision with either MD5 or SHA-1 is infinitesimal (less than 1 in 10^18 for the weaker MD5).
MD5 is shorter and in my experience about 25% faster to compute.
Personally I've tended to view MD5 as more than good enough in offline analyses.
-Robert Rohde
On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde rarohde@gmail.com wrote:
Is there a good reason to prefer SHA-1?
Both have weaknesses allowing one to construct a collision (with considerable effort)
Considerable effort? I can create an MD5 collision in a few minutes on my home computer. Is there anything even remotely like this for SHA-1?
MD5 is shorter and in my experience about 25% faster to compute.
Personally I've tended to view MD5 as more than good enough in offline analyses.
For offline analyses, there's no need to change the online database tables.
On Sat, Sep 17, 2011 at 4:56 PM, Anthony wikimail@inbox.org wrote:
On Sat, Sep 17, 2011 at 6:46 PM, Robert Rohde rarohde@gmail.com wrote:
Is there a good reason to prefer SHA-1?
Both have weaknesses allowing one to construct a collision (with considerable effort)
Considerable effort? I can create an MD5 collision in a few minutes on my home computer. Is there anything even remotely like this for SHA-1?
If I've been keeping up to date, the collision complexity for MD5 is about 2^21 operations, and runs in a few seconds (not minutes); and for SHA-1 down to about 2^52 with current results. The latter represents about 100 cpu-years, which is within the realm of supercomputers. That time will probably continue to come down if people find ways to improve the attacks on SHA-1. (The existing attacks usually require the ability to feed arbitrary binary strings into the hash function. Given that both browsers and Mediawiki will tend to reject binary data placed in an edit window, I'm not sure if any of the existing attacks could be reliably applied to Mediawiki editing.)
If collision attacks really matter we should use SHA-1. However, do any of the proposed use cases care about whether someone might intentionally inject a collision? In the proposed uses I've looked at it, it seems irrelevant. The intentional collision will get flagged as a revert and the text leading to that collision would be discarded. How is that a bad thing?
It's a not a big deal, but if I understand prior comments correctly, most of the existing offline infrastructure uses MD5, so I'm wondering if there is a distinct use case for favoring SHA-1.
MD5 is shorter and in my experience about 25% faster to compute.
Personally I've tended to view MD5 as more than good enough in offline analyses.
For offline analyses, there's no need to change the online database tables.
Need? That's debatable, but one of the major motivators is the desire to have hash values in database dumps (both for revert checks and for checksums on correct data import / export). Both of those are "offline" uses, but it is beneficial to have that information precomputed and stored rather than frequently regenerated.
-Robert Rohde
Στις 17-09-2011, ημέρα Σαβ, και ώρα 22:55 -0700, ο/η Robert Rohde έγραψε:
On Sat, Sep 17, 2011 at 4:56 PM, Anthony wikimail@inbox.org wrote:
<snip>
For offline analyses, there's no need to change the online database tables.
Need? That's debatable, but one of the major motivators is the desire to have hash values in database dumps (both for revert checks and for checksums on correct data import / export). Both of those are "offline" uses, but it is beneficial to have that information precomputed and stored rather than frequently regenerated.
If we don't have it in the online database tables, this defeats the purpose of having the value in there at all, for the purpose of generating the XML dumps.
Recall that the dumps are generated in two passes; in the first pass we retrieve from the db and record all of the metadata about revisions, and in the second (time-comsuming) pass we re-use the text of the revisions from a previous dump file if the text is in there. We want to compare the has of that text against what the online database says the hash is; if they don't match, we want to fetch the live copy.
I refer folks to bug 23264 [1] as proof that mismatch between the metadata and the text has crept in in the past; changes to MW code in other places than the backups scripts can cause quite subtle breakage.
Ariel
On Sun, Sep 18, 2011 at 2:33 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
Στις 17-09-2011, ημέρα Σαβ, και ώρα 22:55 -0700, ο/η Robert Rohde έγραψε:
On Sat, Sep 17, 2011 at 4:56 PM, Anthony wikimail@inbox.org wrote:
<snip>
For offline analyses, there's no need to change the online database tables.
Need? That's debatable, but one of the major motivators is the desire to have hash values in database dumps (both for revert checks and for checksums on correct data import / export). Both of those are "offline" uses, but it is beneficial to have that information precomputed and stored rather than frequently regenerated.
If we don't have it in the online database tables, this defeats the purpose of having the value in there at all, for the purpose of generating the XML dumps.
Recall that the dumps are generated in two passes; in the first pass we retrieve from the db and record all of the metadata about revisions, and in the second (time-comsuming) pass we re-use the text of the revisions from a previous dump file if the text is in there. We want to compare the has of that text against what the online database says the hash is; if they don't match, we want to fetch the live copy.
Well, this is exactly the type of use in which collisions do matter. Do you really want the dump to not record the correct data when some miscreant creates an intentional collision?
On 09/18/2011 08:55 AM, Robert Rohde wrote:
people find ways to improve the attacks on SHA-1. (The existing attacks usually require the ability to feed arbitrary binary strings into the hash function. Given that both browsers and Mediawiki will tend to reject binary data placed in an edit window, I'm not sure if any of the existing attacks could be reliably applied to Mediawiki editing.)
I'm pretty sure MediaWiki will accept any data that's valid UTF-8, modulo canonicalization perhaps. I'm not very familiar with the MD5 and SHA-1 collision attacks, but I wouldn't be surprised if at least some of them could be modified to use, say, only 7-bit ASCII.
If collision attacks really matter we should use SHA-1. However, do any of the proposed use cases care about whether someone might intentionally inject a collision? In the proposed uses I've looked at it, it seems irrelevant. The intentional collision will get flagged as a revert and the text leading to that collision would be discarded. How is that a bad thing?
Well, if you could predict the content of a version that someone (say, a bot) was likely to save sometime in the future, and created a different revision with the same hash (say, in the sandbox or in your userspace, so that people wouldn't notice it) in advance...
Depending on just what page was targeted, the consequences could range from a minor annoyance to site-wide JS injection.
Anyway, I wouldn't suggest using either MD5 or SHA-1: both have known attacks, and it's a fundamental rule of cryptography that attacks always get better over time, never worse. Let's _at least_ use SHA-2.
(Actually, I'd suggest designing the format so that we can change hash functions in the future without having to rehash every old revision immediately. For example, we might store a hash computed using SHA-256 as "sha256:d9014c4624844aa..." or something like that.)
On Sat, Sep 3, 2011 at 12:33 AM, Rob Lanphier robla@wikimedia.org wrote:
I generally suspect that a standard index is going to be a waste for the most urgent uses of this. It will rarely be interesting to search for common hashes between articles. The far more common case will be to search for duplicate hashes within the history of a single article. My understanding is that having a normal index on a table the size of our revision table will be far too expensive for db writes.
If it's useful for Wikimedia and the index is expensive to maintain, the index can be kept on only one slave per DB cluster to minimize cost, the way we've always done for RC. If it's not useful for Wikimedia, only third-party researchers, then we could leave it unindexed on the cluster and have an index only on the toolserver or whatnot.
It's not so much that an extra index would be untenably expensive in principle, more that it's not worth the cost if it's not used for anything important. If it's just for statistical analysis and not for the live site, there's no reason to have it on all servers, except maybe administrative simplicity. The toolserver already has a bunch of indexes that Wikimedia doesn't, for exactly this sort of reason.
This is the first I've heard of partial indexes (/me researches) I don't know if a partial index is going to be cheap enough that we can use it, and useful enough that we'd want to. Would this be a faster query in a world with a partial index on the first six characters? SELECT rev_id FROM revision WHERE rev_page=12345 AND rev_sha1='4cdbd80be15fcfff139fb8a95f2ca359520939ee'
I'd think so, yes. MySQL should be smart enough to use partial indexes at least that far -- otherwise there'd be no point in supporting them. I'd think an index on the first several bytes should be almost as effective as one on the whole value, if you just want to do filtering or joining (not sorting). It might be more effective, since more would fit in RAM. However, I don't know for sure offhand.
...and would either of these queries be considered too expensive to run without a partial index? How about with a partial index?
Without a suitable index, running either of these queries would scan the entire revision history of the article in question. That would certainly not be acceptable on the cluster if the query is to be run with any frequency on arbitrary articles. It would be okay if it were just an occasional thing run on a limited number of articles, or only on articles with few revisions in their histories.
All of the above should be interpreted with the understanding that I stand a decent chance of knowing what I'm talking about, but have no say at this point and shouldn't be taken too seriously. :)
On Sat, Sep 3, 2011 at 2:20 AM, Asher Feldman afeldman@wikimedia.org wrote:
Is code written to populate rev_sha1 on each new edit?
I believe that was part of Aaron's code that got reverted, yes.
Offline generation of hashes is definitely possible, but the only reason you'd do it is to minimize the time during which some rows have an empty hash. It's not strictly offline in that the text will still have to be pulled from ES, but it can be done before the field is added.
Roan
So... this seems to have snuck back in a month ago: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021
https://bugzilla.wikimedia.org/show_bug.cgi?id=21860
Have we resolved the deployment questions on how to actually do the change? Just want to make sure ops has plenty of warning before 1.19 comes down the pipe. (Especially if we have to revert anything back to 1.18 during/after!)
-- brion
On 28/11/11 08:29, Brion Vibber wrote:
So... this seems to have snuck back in a month ago: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021
I don't think it really "snuck", Rob has been talking about it for a while, see e.g. comment 27.
Have we resolved the deployment questions on how to actually do the change? Just want to make sure ops has plenty of warning before 1.19 comes down the pipe. (Especially if we have to revert anything back to 1.18 during/after!)
It can be deployed like any column addition to a large table: on the slaves first, then switch masters, then on the old masters. For 1.17 we changed categorylinks (60M rows on enwiki), and that caused no problems. In 1.18 the schema changes were done by ops (Asher), and included flaggedrevs which is 30M rows on dewiki.
The revision table is 320M rows on enwiki, but it doesn't pose any special challenges, as long as there's enough disk space. The snapshot host db26 is the only host which may possibly be in danger of running out of space, but if its snapshots are deleted and the space reallocated to /a then it won't have any trouble.
Like the previous schema changes, this schema change will be done in advance of the software version change. The old version will work with the new schema, and the default value is harmless, so reverting back to 1.18 or restarting the populate script won't be a problem.
-- Tim Starling
I have no idea about the schema changes, but to choose a digest for detection of identity reverts is pretty simple. The really difficult part is to choose a locally sensitive hash or fingerprint that works for very similar revisions with a lot of content.
I would propose that the digest is stored in the database, and that a lsh or fingerprint is calculated on the fly by the API, unless someone can find a really good way to make and store a lsh or fingerprint that has all necessary properties.
For all the purposes I know (and care) about the digest will be used for detection of identity reverts, while the lsh/fingerprint will be used for resynchronization after difficult partly reverts. In addition it seems likely that fingerprints are necessary for more fine-grained analysis.
It seems like the necessary properties for lsh and the fingerprint scales with increasing content, that makes it difficult to precompute a value.
John
On Mon, Nov 28, 2011 at 2:28 AM, Tim Starling tstarling@wikimedia.org wrote:
On 28/11/11 08:29, Brion Vibber wrote:
So... this seems to have snuck back in a month ago: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021
I don't think it really "snuck", Rob has been talking about it for a while, see e.g. comment 27.
Have we resolved the deployment questions on how to actually do the change? Just want to make sure ops has plenty of warning before 1.19 comes down the pipe. (Especially if we have to revert anything back to 1.18 during/after!)
It can be deployed like any column addition to a large table: on the slaves first, then switch masters, then on the old masters. For 1.17 we changed categorylinks (60M rows on enwiki), and that caused no problems. In 1.18 the schema changes were done by ops (Asher), and included flaggedrevs which is 30M rows on dewiki.
The revision table is 320M rows on enwiki, but it doesn't pose any special challenges, as long as there's enough disk space. The snapshot host db26 is the only host which may possibly be in danger of running out of space, but if its snapshots are deleted and the space reallocated to /a then it won't have any trouble.
Like the previous schema changes, this schema change will be done in advance of the software version change. The old version will work with the new schema, and the default value is harmless, so reverting back to 1.18 or restarting the populate script won't be a problem.
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org