There's been talk for some time of reworking the cur/old split into a table of page information and a table of edit revision information. Attached is a proposed (untested!) schema for page and revisions tables and the transformation from the current ones.
The main expected benefit is to simplify code that has to deal with both old and current revisions: recentchanges, contribs, history, and of course saving of pages, renaming, etc. Instead of jumping through hoops to deal with the same kind of data in cur and old, all same-purpose data should stay in the same place.
This will simplify renames by avoiding the need to touch the revisions table; only page (and of course links, searchindex...). Saves won't have to duplicate information from cur to old; just insert the new row to revisions and update the current revision pointer in page.
Additionally, the current revisions of pages will have an ID number consistent with the old revisions, so we can create revision-specific citation URLs. [Though the deletion system will need to be munged to make it possible to preserve these across delete/undelete cycles.]
A few notes: I think we can get rid of inverse_timestamp. It was needed on MySQL 3.x because it didn't support reverse sorting on indexes and was dog slow; but MySQL 4.0 has been 'in production' for a while now and if you're running a large wiki on 3.x it's your own fault. :)
The schema includes a place for a title sort key in page. With appropriate munging (see docs somewhere on unicode.org) this can be used to get language-appropriate sorting of titles in display. That would be a good thing.
There's also a rev_prior key to link to the previous revision. The old table had this back in phase 2 but it wasn't generally used and got dropped; however this is needed to make it feasible to put diff links into user contribs and other places where we don't necessarily have the next revision sitting in a buffer already.
I've dropped the unused old_flags (though maybe we should have another flag field to indicate future compression options -- gzip and diff have been suggested). Also cur_is_new is dropped in favor of rev_prior; you could key on rev_prior=0 to find new versions.
Comments? Ideas? Complaints?
-- brion vibber (brion @ pobox.com)
Brion-
Comments? Ideas? Complaints?
Possible performance drawbacks when dealing with huge MySQL tables? Limits?
More time needed to fix corruptions? (If a corruption occurs in CUR, fixing that table should be enough. If it occours in the big ass table for all pages, prepare to wait.)
More difficult backup+table export (for those who want just CUR and not OLD), but that can be dealt with.
Regards,
Erik
On Dec 10, 2003, at 17:06, Erik Moeller wrote:
Brion-
Comments? Ideas? Complaints?
Possible performance drawbacks when dealing with huge MySQL tables?
The revisions table would be about the size of the present old table. It would have some more rows, but would no longer have the overhead of duplicate title information. Meanwhile, some functions that deal with page titles but don't care about their contents (whatlinkshere, allpages, orphans, etc) may well be faster by having a smaller page table instead of the text-bloated cur.
Limits?
The maximum size for an InnoDB table (as the total InnoDB table space) is apparently 4 billion pages, with a default page size of 16kb. That's 64 terabytes.
As the wiki is edited more and more, the relative number of current revisions compared to old revisions will continue to shrink; storage-related limits wouldn't be significantly different than with the present schema.
More time needed to fix corruptions? (If a corruption occurs in CUR, fixing that table should be enough. If it occours in the big ass table for all pages, prepare to wait.)
And if there's corruption in the big old table in the old schema? Or the smaller page table in the new schema?
More difficult backup+table export (for those who want just CUR and not OLD), but that can be dealt with.
It's already inconvenient for anyone who wants regular updates, there's a lot of work can be done on that front. A format that can be sensibly read and organized rather than a big SQL statement dump might be useful.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org