Apparently the archives for the list ate everything but the GPG signature, so here it is again as plaintext:
On Dec 10, 2003, at 16:33, Brion Vibber wrote:
There's been talk for some time of reworking the cur/old split into a table of page information and a table of edit revision information. Attached is a proposed (untested!) schema for page and revisions tables and the transformation from the current ones.
The main expected benefit is to simplify code that has to deal with both old and current revisions: recentchanges, contribs, history, and of course saving of pages, renaming, etc. Instead of jumping through hoops to deal with the same kind of data in cur and old, all same-purpose data should stay in the same place.
This will simplify renames by avoiding the need to touch the revisions table; only page (and of course links, searchindex...). Saves won't have to duplicate information from cur to old; just insert the new row to revisions and update the current revision pointer in page.
Additionally, the current revisions of pages will have an ID number consistent with the old revisions, so we can create revision-specific citation URLs. [Though the deletion system will need to be munged to make it possible to preserve these across delete/undelete cycles.]
A few notes: I think we can get rid of inverse_timestamp. It was needed on MySQL 3.x because it didn't support reverse sorting on indexes and was dog slow; but MySQL 4.0 has been 'in production' for a while now and if you're running a large wiki on 3.x it's your own fault. :)
The schema includes a place for a title sort key in page. With appropriate munging (see docs somewhere on unicode.org) this can be used to get language-appropriate sorting of titles in display. That would be a good thing.
There's also a rev_prior key to link to the previous revision. The old table had this back in phase 2 but it wasn't generally used and got dropped; however this is needed to make it feasible to put diff links into user contribs and other places where we don't necessarily have the next revision sitting in a buffer already.
I've dropped the unused old_flags (though maybe we should have another flag field to indicate future compression options -- gzip and diff have been suggested). Also cur_is_new is dropped in favor of rev_prior; you could key on rev_prior=0 to find new versions.
Comments? Ideas? Complaints?
-- brion vibber (brion @ pobox.com) <patch-revisions.sql>
-- Change cur/old into page/revisions
-- page_sortkey will contain a munged version of the title for --- language-specific sorting -- page_current is a foreign key on revisions.rev_id DROP TABLE IF EXISTS page; CREATE TABLE page ( page_id int(8) unsigned NOT NULL auto_increment, page_namespace tinyint(2) unsigned NOT NULL default '0', page_title varchar(255) binary NOT NULL default '', page_sortkey varchar(255) binary NOT NULL default '', page_restrictions tinyblob NOT NULL default '', page_counter bigint(20) unsigned NOT NULL default '0', page_is_redirect tinyint(1) unsigned NOT NULL default '0', page_random real unsigned NOT NULL, page_touched char(14) binary NOT NULL default '', page_current int(8) unsigned NOT NULL default 0, UNIQUE KEY page_id (page_id), UNIQUE KEY namespace_title (page_namespace,page_title), KEY (page_random), );
-- rev_page is a foreign key to page.page_id -- rev_prior will key to another rev_id for the previous revision, to simplify -- making diff links etc DROP TABLE IF EXISTS revisions; CREATE TABLE revisions ( rev_id int(8) unsigned NOT NULL auto_increment, rev_page int(8) unsigned NOT NULL default 0, rev_prior int(8) unsigned NOT NULL default 0, rev_text mediumtext NOT NULL default '', rev_comment tinyblob NOT NULL default '', rev_user int(5) unsigned NOT NULL default '0', rev_user_text varchar(255) binary NOT NULL, rev_timestamp char(14) binary NOT NULL default '', rev_minor_edit tinyint(1) NOT NULL default '0', rev_current tinyint(1) NOT NULL default 0, UNIQUE KEY rev_id (rev_id), ADD INDEX (rev_timestamp), ADD INDEX page_timestamp (rev_page,rev_timestamp), ADD INDEX user_timestamp (rev_user,rev_timestamp), ADD INDEX usertext_timestamp (rev_user_text,rev_timestamp); );
-- Import old revisisons from old to revisions INSERT INTO revisions (rev_id,rev_page,rev_text,rev_comment,rev_user,rev_user_text, rev_timestamp,rev_minor_edit,rev_flags) SELECT old_id,cur_id,old_text,old_comment,old_user,old_user_text, old_timestamp,old_minor_edit,old_flags) FROM old,cur WHERE old_namespace=cur_namespace and old_title=cur_title;
-- Import cur revisions from cur to revisions -- This will create new rev_id revisions keys. INSERT INTO revisions (rev_page,rev_text,rev_comment,rev_user,rev_user_text, rev_timestamp,rev_minor_edit,rev_cur_edit) SELECT cur_id,cur_text,cur_comment,cur_user,cur_user_text, cur_timestamp,cur_minor_edit FROM cur;
-- Import page definitions from cur to page INSERT INTO page (page_id,page_namespace,page_title,page_restrictions, page_counter,page_is_redirect,page_random,page_touched, page_current) SELECT cur_id,cur_namespace,cur_title,cur_restrictions, cur_counter,cur_is_redirect,cur_random,cur_touched, rev_id FROM cur,revisions WHERE cur_id=rev_page and rev_is_current;
-- rev_prior and page_sortkey need to be filled out by a php script?
Brion Vibber wrote:
Comments? Ideas? Complaints?
-- brion vibber (brion @ pobox.com) <patch-revisions.sql>
-- Change cur/old into page/revisions
-- page_sortkey will contain a munged version of the title for --- language-specific sorting -- page_current is a foreign key on revisions.rev_id DROP TABLE IF EXISTS page; CREATE TABLE page ( page_id int(8) unsigned NOT NULL auto_increment, page_namespace tinyint(2) unsigned NOT NULL default '0', page_title varchar(255) binary NOT NULL default '', page_sortkey varchar(255) binary NOT NULL default '', page_restrictions tinyblob NOT NULL default '', page_counter bigint(20) unsigned NOT NULL default '0', page_is_redirect tinyint(1) unsigned NOT NULL default '0', page_random real unsigned NOT NULL, page_touched char(14) binary NOT NULL default '', page_current int(8) unsigned NOT NULL default 0, UNIQUE KEY page_id (page_id), UNIQUE KEY namespace_title (page_namespace,page_title), KEY (page_random), );
-- rev_page is a foreign key to page.page_id -- rev_prior will key to another rev_id for the previous revision, to simplify -- making diff links etc DROP TABLE IF EXISTS revisions; CREATE TABLE revisions ( rev_id int(8) unsigned NOT NULL auto_increment, rev_page int(8) unsigned NOT NULL default 0, rev_prior int(8) unsigned NOT NULL default 0, rev_text mediumtext NOT NULL default '', rev_comment tinyblob NOT NULL default '', rev_user int(5) unsigned NOT NULL default '0', rev_user_text varchar(255) binary NOT NULL, rev_timestamp char(14) binary NOT NULL default '', rev_minor_edit tinyint(1) NOT NULL default '0', rev_current tinyint(1) NOT NULL default 0, UNIQUE KEY rev_id (rev_id), ADD INDEX (rev_timestamp), ADD INDEX page_timestamp (rev_page,rev_timestamp), ADD INDEX user_timestamp (rev_user,rev_timestamp), ADD INDEX usertext_timestamp (rev_user_text,rev_timestamp); );
-- Import old revisisons from old to revisions INSERT INTO revisions (rev_id,rev_page,rev_text,rev_comment,rev_user,rev_user_text, rev_timestamp,rev_minor_edit,rev_flags) SELECT old_id,cur_id,old_text,old_comment,old_user,old_user_text, old_timestamp,old_minor_edit,old_flags) FROM old,cur WHERE old_namespace=cur_namespace and old_title=cur_title;
-- Import cur revisions from cur to revisions -- This will create new rev_id revisions keys. INSERT INTO revisions (rev_page,rev_text,rev_comment,rev_user,rev_user_text, rev_timestamp,rev_minor_edit,rev_cur_edit) SELECT cur_id,cur_text,cur_comment,cur_user,cur_user_text, cur_timestamp,cur_minor_edit FROM cur;
-- Import page definitions from cur to page INSERT INTO page (page_id,page_namespace,page_title,page_restrictions, page_counter,page_is_redirect,page_random,page_touched, page_current) SELECT cur_id,cur_namespace,cur_title,cur_restrictions, cur_counter,cur_is_redirect,cur_random,cur_touched, rev_id FROM cur,revisions WHERE cur_id=rev_page and rev_is_current;
-- rev_prior and page_sortkey need to be filled out by a php script?
I like it, I just have one question, and pardon my ignorance, but why the use of binary fields? are they faster or smaller or less easly corrupted?
In any case, good work Brion, this schema will simplify things and make the overall code a lot cleaner.
Lightning
On Dec 10, 2003, at 22:19, Lightning wrote:
I like it, I just have one question, and pardon my ignorance, but why the use of binary fields? are they faster or smaller or less easly corrupted?
'binary' on char and varchar fields just tells it not to try to match those fields in a case-insensitive fashion (which it would muck up anyway, having a poor understanding of multiple character sets and no knowledge of Unicode).
-- brion vibber (brion @ pobox.com)
"BV" == Brion Vibber brion@pobox.com writes:
BV> Comments? Ideas? Complaints?
Well, my first feeling is that this looks very, very sweet and I think it is probably a big step in the right direction.
My second feeling is that maybe we should brainstorm a bit about some common features of the software, and make sure that we know how they're going to work with this new schema. Some things off the top of my head:
* Given a title, display a page * When displaying a page, show broken links * Display an old version of a page * Display the history of a page * Display the diffs between two version * Save a new page * Save a new version of an existing page * RecentChanges * Newpages * Show user contributions * Full-text search * Go button * Long pages, short pages * Stub detection * Redirecting * What links here
As I'm going through this list, nothing here sticks out at me as being particularly awkward in the design you gave. Which is probably good.
Oh, OK, so, I have one more feeling, which is: these are pretty major tables, so this would be a pretty sweeping change. It'll destabilize the software for a while, with only some marginal benefit to end-users (better performance, probably, and probably some more robust query features).
Anyways, my $0.02.
~ESP
wikitech-l@lists.wikimedia.org