On 1/18/08, Roan Kattouw roan.kattouw@home.nl wrote:
Two improvements could be made:
- Finally start using the rev_deleted field rather than the archive
table. This changes the INSERT SELECT to a UPDATE WHERE on the revision table. This only affects N rows rather than 2N, and doesn't require SELECTing any rows.
It would still take much too long for the Sandbox, or anything similarly large. As Domas pointed out to me, the change requires seeking through most of the indexes in their entirety: only a couple (like the primary key, and therefore also the table itself) exhibit any kind of locality with respect to rev_page.
- Delete the page table entry immediately (making the page and its
revisions invisible), and schedule moving/rev_deleting the revisions in the job queue. This will severely reduce the load of a delete request, but will delay the old revisions showing up in the undelete pool (the "undelete N deleted revisions?" link), making it hard/impossible to undelete a page shortly after deleting it. A solution could be to move/rev_delete the first revision immediately (i.e. right after deleting the page table entry) as well, so at least the most recent revision can be undeleted.
Here's an O(1) idea that River once suggested for deleting an entire page, or an entire page minus a few recent revisions. Have an extra field in page, say page_first_good or something of that sort. Use it as follows:
* When a new page is created (whether it's actually not in the database, or was just deleted), set it to the rev_id of the new revision. * If a page is partially undeleted, change the value of page_first_good to the smallest undeleted rev_id, and mark all subsequent deleted revisions with rev_deleted. * When a page is deleted, set page_first_good to 2^63-1 or whatever the maximum reasonable rev_id is (and make sure no revisions actually get this high!). Call this marker value PAGE_NONEXISTENT, say. * When looking for pages that exist, add an extra condition page_first_good != PAGE_NONEXISTENT. * When looking for non-deleted revisions corresponding to a particular non-deleted page, use the additional join condition rev_id >= page_first_good. (This is an extra range scan, but since it should only be used in join conditions I don't *think* it should cause filesorts.)
Now, all of these are hopefully a logarithmic-factor slowdown at worst, i.e., you'd have to expand existing indexes, *except* for partial undeletion: the worst case would be deleting the whole page and then undeleting only the first revision, which would be as bad as the rev_deleted scenario. The result is that the common operation of deleting an entire page, or deleting an entire page except for a few recent revisions, is fast.
Of course, this is a rather intricate plan. The simpler, but of course less robust, idea would be to just run the deletions as a job, as you say. The problem is, of course, that the deletion is no longer synchronous; and you can't do the jobs on demand, because that would bring us back to the present situation as soon as anyone views the history page. This is quite possibly acceptable.
A slight refinement (not as dramatic as the plan I stole from River) might be to mark the page row deleted and totally inaccessible until the jobs are done, except for undeletion or those with the right to view deleted revisions. This would have the negative side effect of stopping its recreation for quite possibly some time, if it's a large page, but that may be more desirable than allowing it to be accessible but constantly changing as more rows get marked deleted.
In any case, just switching to rev_deleted should be something of an improvement, even if we keep on with the simplistic synchronous way we do deletions now. But it's still not great. Ultimately, doing *anything* to N arbitrary revisions will probably have to be O(N) in the worst case, no matter how clever you are. The best thing you can do is optimize for the common cases, and prevent the bad cases.