On 1/18/08, Roan Kattouw <roan.kattouw(a)home.nl> wrote:
Two improvements could be made:
- Finally start using the rev_deleted field rather than the archive
table. This changes the INSERT SELECT to a UPDATE WHERE on the revision
table. This only affects N rows rather than 2N, and doesn't require
SELECTing any rows.
It would still take much too long for the Sandbox, or anything
similarly large. As Domas pointed out to me, the change requires
seeking through most of the indexes in their entirety: only a couple
(like the primary key, and therefore also the table itself) exhibit
any kind of locality with respect to rev_page.
- Delete the page table entry immediately (making the
page and its
revisions invisible), and schedule moving/rev_deleting the revisions in
the job queue. This will severely reduce the load of a delete request,
but will delay the old revisions showing up in the undelete pool (the
"undelete N deleted revisions?" link), making it hard/impossible to
undelete a page shortly after deleting it. A solution could be to
move/rev_delete the first revision immediately (i.e. right after
deleting the page table entry) as well, so at least the most recent
revision can be undeleted.
Here's an O(1) idea that River once suggested for deleting an entire
page, or an entire page minus a few recent revisions. Have an extra
field in page, say page_first_good or something of that sort. Use it
as follows:
* When a new page is created (whether it's actually not in the
database, or was just deleted), set it to the rev_id of the new
revision.
* If a page is partially undeleted, change the value of
page_first_good to the smallest undeleted rev_id, and mark all
subsequent deleted revisions with rev_deleted.
* When a page is deleted, set page_first_good to 2^63-1 or whatever
the maximum reasonable rev_id is (and make sure no revisions actually
get this high!). Call this marker value PAGE_NONEXISTENT, say.
* When looking for pages that exist, add an extra condition
page_first_good != PAGE_NONEXISTENT.
* When looking for non-deleted revisions corresponding to a particular
non-deleted page, use the additional join condition rev_id >=
page_first_good. (This is an extra range scan, but since it should
only be used in join conditions I don't *think* it should cause
filesorts.)
Now, all of these are hopefully a logarithmic-factor slowdown at
worst, i.e., you'd have to expand existing indexes, *except* for
partial undeletion: the worst case would be deleting the whole page
and then undeleting only the first revision, which would be as bad as
the rev_deleted scenario. The result is that the common operation of
deleting an entire page, or deleting an entire page except for a few
recent revisions, is fast.
Of course, this is a rather intricate plan. The simpler, but of
course less robust, idea would be to just run the deletions as a job,
as you say. The problem is, of course, that the deletion is no longer
synchronous; and you can't do the jobs on demand, because that would
bring us back to the present situation as soon as anyone views the
history page. This is quite possibly acceptable.
A slight refinement (not as dramatic as the plan I stole from River)
might be to mark the page row deleted and totally inaccessible until
the jobs are done, except for undeletion or those with the right to
view deleted revisions. This would have the negative side effect of
stopping its recreation for quite possibly some time, if it's a large
page, but that may be more desirable than allowing it to be accessible
but constantly changing as more rows get marked deleted.
In any case, just switching to rev_deleted should be something of an
improvement, even if we keep on with the simplistic synchronous way we
do deletions now. But it's still not great. Ultimately, doing
*anything* to N arbitrary revisions will probably have to be O(N) in
the worst case, no matter how clever you are. The best thing you can
do is optimize for the common cases, and prevent the bad cases.