brion@svn.wikimedia.org schreef:
Revision: 40736 Author: brion Date: 2008-09-12 00:16:41 +0000 (Fri, 12 Sep 2008)
Log Message:
some first steps towards my horrifically evil code review tool to integrate in the wiki (doesn't work yet :)
Sounds like a very interesting project.
<snip>
- -- Update type: Modify, Add, Delete, Rename
- cp_action enum ('M','A','D','R'),
Does rename even exist in SVN? Correct me if I'm wrong, but I always thought SVN implemented renaming as a combination of copy and delete. Speaking about copy, do you have a way to store and display stuff like (Copied from foo) for added files?
+-- And for our commenting system... +-- To specify follow-up relationships... +CREATE TABLE /*$wgDBprefix*/code_relations (
- cf_repo_id int not null,
- -- -> cr_id
- cf_from int not null,
- -- -> cr_id
- cf_to int not null,
- primary key (cf_repo_id, cf_from, cf_to)
+) /*$wgDBTableOptions*/;
Why didn't you just add cr_prev and cr_next to code_revisions? Also, the cf_ prefix is kind of a weird choice here, since there's no 'f' in code_relations.
+-- Freetext tagging for revisions +CREATE TABLE /*$wgDBprefix*/code_tags (
- ct_repo_id int not null,
- ct_rev_id int not null,
- ct_tag varbinary(255) not null,
- primary key (ct_repo_id,ct_rev_id,ct_tag),
- key (ct_repo_id,ct_tag,ct_rev_id)
+) /*$wgDBTableOptions*/;
What exactly are these tags going to do? Native support for marking as reviewed or reverted would be nice. Also, the only indices on this table are (ct_repo_id, ct_rev_id, ct_tag) and (ct_repo_id, ct_tag, ct_rev_id). Wouldn't (ct_tag, ct_repo_id, ct_rev_id) be useful too (e.g. for finding all revisions in all repos with a certain tag)?
+CREATE TABLE /*$wgDBprefix*/code_comment (
- -- Unique ID of the comment within the system.
- cc_id int auto_increment not null,
- -- Repo and code revision this comment is attached to
- cc_repo_id int not null,
- cc_rev_id int not null,
- -- Wikitext blob of the comment.
- -- FIXME: Consider using standard text store?
- cc_text blob,
Does 'standard' text store mean the text table or the external disk thingy? I would recommend using the same class that stores revision text, so this extension doesn't have to worry about whether external storage is enabled or not.
- -- Timestamps of threaded parent and self to present a
- -- convenient threaded sort order:
- -- "20080130123456"
- -- "20080130123456,20080230123456"
- -- "20080430123456"
- --
- -- Allows 17 levels of nesting before we hit the length limit.
- -- Could redo more compactly to get 31 or 63 levels.
- cc_sortkey varbinary(255),
Nice one. What will happen when the limit is hit, though?
- -- Does this comment confer a review sum?
- -- 0, +1, -1
- cc_review int,
- primary key (cc_id),
- key (cc_repo_id,cc_rev_id,cc_sortkey)
+) /*$wgDBTableOptions*/;
You might wanna add a cr_review field holding the sum of all these cc_review's to the code_rev table, along with an index. With that, the review sum will always be correct even if not all comments are displayed, and it'll be possible to search for e.g. all revisions with a negative review sum.
About the extension as a whole: are new revisions added to the system the second they're made, or are they imported on a regular basis? Also, it would be nice to have some kind of e-mail notification feature (either through the wiki's e-mail system or through a mailing list).
Roan Kattouw (Catrope)
On Fri, Sep 12, 2008 at 8:47 AM, Roan Kattouw roan.kattouw@home.nl wrote:
- -- Timestamps of threaded parent and self to present a
- -- convenient threaded sort order:
- -- "20080130123456"
- -- "20080130123456,20080230123456"
- -- "20080430123456"
- --
- -- Allows 17 levels of nesting before we hit the length limit.
- -- Could redo more compactly to get 31 or 63 levels.
- cc_sortkey varbinary(255),
Nice one. What will happen when the limit is hit, though?
That *is* clever. LiquidThreads should consider something like that for thread sorting (which AFAIK must currently be done in the application). I don't think it's a big issue to limit thread depth to 63 levels, but in principle you could extend it somewhat further, up to the maximum index length.
Aryeh Gregor schreef:
On Fri, Sep 12, 2008 at 8:47 AM, Roan Kattouw roan.kattouw@home.nl wrote:
- -- Timestamps of threaded parent and self to present a
- -- convenient threaded sort order:
- -- "20080130123456"
- -- "20080130123456,20080230123456"
- -- "20080430123456"
- --
- -- Allows 17 levels of nesting before we hit the length limit.
- -- Could redo more compactly to get 31 or 63 levels.
- cc_sortkey varbinary(255),
Nice one. What will happen when the limit is hit, though?
That *is* clever. LiquidThreads should consider something like that for thread sorting (which AFAIK must currently be done in the application). I don't think it's a big issue to limit thread depth to 63 levels, but in principle you could extend it somewhat further, up to the maximum index length.
Watch out for multibyte collations there. In the past, MW has had numerous issues with multibyte collations causing index lengths to almost double up and hit the limit.
Roan Kattouw (Catrope)
On Fri, Sep 12, 2008 at 9:27 AM, Roan Kattouw roan.kattouw@home.nl wrote:
Watch out for multibyte collations there. In the past, MW has had numerous issues with multibyte collations causing index lengths to almost double up and hit the limit.
VARBINARY doesn't have collations. Which is part of why we use it for everything. (Well, that and because Wikipedia runs on 4.0, which doesn't have collations for anything. :) ) A utf8 collation will triple the maximum size of the index, yes.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Roan Kattouw wrote:
- -- Update type: Modify, Add, Delete, Rename
- cp_action enum ('M','A','D','R'),
Does rename even exist in SVN? Correct me if I'm wrong, but I always thought SVN implemented renaming as a combination of copy and delete.
Looks like I was making shit up and guessed wrong -- it's actually "replace".
http://us.php.net/manual/en/function.svn-log.php
Speaking about copy, do you have a way to store and display stuff like (Copied from foo) for added files?
SVN's log and diff output are pretty "dumb" by default for moves and show an entire-file removal followed by an entire-file creation.
One could probably detect these cases and show something nicer.
+-- And for our commenting system... +-- To specify follow-up relationships... +CREATE TABLE /*$wgDBprefix*/code_relations (
- cf_repo_id int not null,
- -- -> cr_id
- cf_from int not null,
- -- -> cr_id
- cf_to int not null,
- primary key (cf_repo_id, cf_from, cf_to)
+) /*$wgDBTableOptions*/;
Why didn't you just add cr_prev and cr_next to code_revisions?
Because that would be totally insufficient. How many of your code changes only relate to *one* revision before them and *one* after them? I know a lot of mine have many, many more relations. Take this example case:
* r100 introduces a bug * r200 makes a broken fix for the bug in r100 * r201 tweaks r200, which fixes r100 * r202 tweaks r200/201, which fixes r100 * r230 reverts r200, r201, r202 (which fix r100) cause it breaks something else * r285 restores the r200/r201/r202 fix for r100, but with additional fixes so it's right
A single "prev" and "next" don't really indicate the relationships to which other revs you want to look at.
Also, the cf_ prefix is kind of a weird choice here, since there's no 'f' in code_relations.
Um, it might have made since at some point. ;)
+-- Freetext tagging for revisions +CREATE TABLE /*$wgDBprefix*/code_tags (
What exactly are these tags going to do? Native support for marking as reviewed or reverted would be nice.
A few possible tags off the top of my head:
reviewed, reverted, sucks, awesome, cleanup, i18n, fixme, backport, branchpoint, suspicious, revertme, funny, blogme, major, feature, regression, bc, compat, wtf
Also, the only indices on this table are (ct_repo_id, ct_rev_id, ct_tag) and (ct_repo_id, ct_tag, ct_rev_id). Wouldn't (ct_tag, ct_repo_id, ct_rev_id) be useful too (e.g. for finding all revisions in all repos with a certain tag)?
Prolly good, yes. :)
- -- Wikitext blob of the comment.
- -- FIXME: Consider using standard text store?
- cc_text blob,
Does 'standard' text store mean the text table or the external disk thingy? I would recommend using the same class that stores revision text, so this extension doesn't have to worry about whether external storage is enabled or not.
Right, that's what encapsulation is for. ;)
These comments, unlike page text, though are generally going to be smallish and won't have multiple revisions -- so no batch gzip/diff benefits.
- -- Timestamps of threaded parent and self to present a
- -- convenient threaded sort order:
- -- "20080130123456"
- -- "20080130123456,20080230123456"
- -- "20080430123456"
- --
- -- Allows 17 levels of nesting before we hit the length limit.
- -- Could redo more compactly to get 31 or 63 levels.
- cc_sortkey varbinary(255),
Nice one. What will happen when the limit is hit, though?
Well, sorting breaks. :) Probably would want to have it just not let you add child replies beyond the arbitrary depth limit (boo-hoo).
You might wanna add a cr_review field holding the sum of all these cc_review's to the code_rev table, along with an index. With that, the review sum will always be correct even if not all comments are displayed, and it'll be possible to search for e.g. all revisions with a negative review sum.
Considered it, haven't bothered yet. ;)
About the extension as a whole: are new revisions added to the system the second they're made, or are they imported on a regular basis? Also, it would be nice to have some kind of e-mail notification feature (either through the wiki's e-mail system or through a mailing list).
There are a few possibilities: * check for new revs on-demand * periodic cron check * use a notification event to trigger a check for updates (like existing IRC and mail notifications)
A post-commit hook in SVN could easily ping the wiki to go grab the latest revisions, for instance.
- -- brion
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Also, the cf_ prefix is kind of a weird choice here, since there's no 'f' in code_relations.
Um, it might have made since at some point. ;)
It was at some point "followup"
- -- brion
Speaking about copy, do you have a way to store and display
stuff like
(Copied from foo) for added files?
SVN's log and diff output are pretty "dumb" by default for moves and show an entire-file removal followed by an entire-file creation.
Seem to remember that SVN tracked them, from when played with talking to SVN servers direct over HTTP.
For added & replaced operations it returned copyfrom-path & copyfrom-revision attributes in the returned XML.
Jared
Brion Vibber schreef:
Why didn't you just add cr_prev and cr_next to code_revisions?
Because that would be totally insufficient. How many of your code changes only relate to *one* revision before them and *one* after them? I know a lot of mine have many, many more relations. Take this example case:
- r100 introduces a bug
- r200 makes a broken fix for the bug in r100
- r201 tweaks r200, which fixes r100
- r202 tweaks r200/201, which fixes r100
- r230 reverts r200, r201, r202 (which fix r100) cause it breaks
something else
- r285 restores the r200/r201/r202 fix for r100, but with additional
fixes so it's right
A single "prev" and "next" don't really indicate the relationships to which other revs you want to look at.
Right, I forgot the fact that you can have complex trees with nodes (revisions) having multiple parents and multiple children.
A few possible tags off the top of my head:
reviewed, reverted, sucks, awesome, cleanup, i18n, fixme, backport, branchpoint, suspicious, revertme, funny, blogme, major, feature, regression, bc, compat, wtf
Basically, these tags can be divided into two categories: tags that link the revision to other revisions, and those that don't. This got me thinking about merging the code_relations and code_tags tables for the first kind of tags.
To clarify using your earlier example: * r100 needs to be fixed (fixme tag) * r200 fixes (fix tag) r100 (parent) and r100 was fixed (fixed tag) in r200 (child) * r201 tweaks (cleanup tag) r200 (parent), and r200 was cleaned up (cleanedup tag) in r201 (child) * r202 tweaks (cleanup tag) r200 and r201 (parents), and r200 and r201 were cleaned up (cleanedup tag) in r202 (child) * r230 reverts (revert tag) r200, r201 and r202 (parents), and r200, r201 and r202 were reverted (reverted tag) in r230 (child) * r285 reverts (revert tag) r230 (parent) and r230 was reverted (reverted tag) in r285 * r285 fixes (fix tag) r100, r200, r201 and r202 (parents), and r100, r200, r201 and r202 were fixed (fixed tag) in r285
This calls for a table that defines not only parent-child relationships, but also the nature of those relationships (using tags). That table could look like:
CREATE TABLE code_relations ( -- Key to repo_id rel_repo_id INT NOT NULL,
-- Key to cr_id rel_from INT NOT NULL,
-- Key to cr_id -- NULL for single tags without a parent-child relationship rel_to INT NULL,
-- Tag that applies to this relationship rel_tag VARBINARY(255) NOT NULL,
PRIMARY KEY(rel_repo_id, rel_from, rel_to, rel_tag), KEY rel_to(rel_repo_id, rel_to, rel_from, rel_tag), KEY rel_repotag(rel_repo_id, rel_tag, rel_from, rel_to), KEY rel_tag(rel_tag, rel_repo_id, rel_from, rel_to) );
(Come to think of it: you might wanna name your indices so they can be FORCE INDEXed if necessary.)
In this case, the rows would look like (assuming repo_id=0): (0, 100, NULL, 'fixme'), (0, 100, 200, 'fixed'), (0, 200, 100, 'fix'), ... (0, 230, 285, 'reverted'), (0, 285, 230, 'revert'), (0, 285, 100, 'fix'), ... (0, 285, 202, 'fix'), (0, 100, 285, 'fixed'), ... (0, 202, 285, 'fixed')
These comments, unlike page text, though are generally going to be smallish and won't have multiple revisions -- so no batch gzip/diff benefits.
True.
Nice one. What will happen when the limit is hit, though?
Well, sorting breaks. :) Probably would want to have it just not let you add child replies beyond the arbitrary depth limit (boo-hoo).
That's what I was thinking. It's not ideal, but if the depth limit is high enough (the 17 you mentioned is probably enough), a limit violation probably won't happen too often. Also, you can't really blame the software for not allowing you to add an 18th-level comment: since you're probably gonna put deeper-nested comments further to the right, such comments would look horribly ugly anyway (I don't even wanna think about what 64 levels would look like), and 17 is already quite a lot of comments for one revision. Such large discussions probably shouldn't take place on CodeReview, but on wikitech-l or BugZilla.
There are a few possibilities:
- check for new revs on-demand
- periodic cron check
- use a notification event to trigger a check for updates (like existing
IRC and mail notifications)
A post-commit hook in SVN could easily ping the wiki to go grab the latest revisions, for instance.
I was personally thinking about the third option, since that's how the existing notifications work.
Roan Kattouw (Catrope)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Roan Kattouw wrote: [snip]
This calls for a table that defines not only parent-child relationships, but also the nature of those relationships (using tags). That table could look like:
*nod* I was considering a relationship type tag in the relations table, but wasn't going to implement it at first to simplify things.
If you wanna help go right ahead though, it'd be handy. :D
- -- brion
Brion Vibber schreef:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Roan Kattouw wrote: [snip]
This calls for a table that defines not only parent-child relationships, but also the nature of those relationships (using tags). That table could look like:
*nod* I was considering a relationship type tag in the relations table, but wasn't going to implement it at first to simplify things.
If you wanna help go right ahead though, it'd be handy. :D
I've got another project that's nearly finished, so not right now. But I'm sure I'll get involved in this some time. Might as well add an API module :D
Roan Kattouw (Catrope)
On Fri, Sep 12, 2008 at 5:26 PM, Roan Kattouw roan.kattouw@home.nl wrote:
(Come to think of it: you might wanna name your indices so they can be FORCE INDEXed if necessary.)
Indexes are named after their first column, by default. If there's a conflict, MySQL appends _2, _3, etc. (or maybe it starts at _1, I can't recall).
Aryeh Gregor schreef:
On Fri, Sep 12, 2008 at 5:26 PM, Roan Kattouw roan.kattouw@home.nl wrote:
(Come to think of it: you might wanna name your indices so they can be FORCE INDEXed if necessary.)
Indexes are named after their first column, by default. If there's a conflict, MySQL appends _2, _3, etc. (or maybe it starts at _1, I can't recall).
Still, it would be cleaner and clearer to explicitly name them, like we do in tables.sql
Roan Kattouw (Catrope)
wikitech-l@lists.wikimedia.org