A couple times a year (such as about an hour ago) somebody does something like trying to delete the Wikipedia:Sandbox on en.wikipedia.org, which reaaalllly bogs down the server due to the large number of revisions.
While there are warnings about this, I'm hacking in some limits which will restrict such deletions to keep the system from falling over accidentally.
At the moment I've set the limit at 5000 revisions (as $wgDeleteRevisionsLimit). The error message right now is generic and there's no override group with 'bigdelete' privilege live, but it should be prettified soon.
(Note -- the revision count is done with an index estimate currently, so could overestimate on some pages possibly.)
-- brion
--- Brion Vibber brion@wikimedia.org wrote:
A couple times a year (such as about an hour ago) somebody does something like trying to delete the Wikipedia:Sandbox on en.wikipedia.org, which reaaalllly bogs down the server due to the large number of revisions.
While there are warnings about this, I'm hacking in some limits which will restrict such deletions to keep the system from falling over accidentally.
At the moment I've set the limit at 5000 revisions (as $wgDeleteRevisionsLimit). The error message right now is generic and there's no override group with 'bigdelete' privilege live, but it should be prettified soon.
(Note -- the revision count is done with an index estimate currently, so could overestimate on some pages possibly.)
-- brion
Sounds good. Who's getting the 'bigdelete' permission, stewards?
-Gurch
On 17/01/2008, Matthew Britton matthew.britton@btinternet.com wrote:
Sounds good. Who's getting the 'bigdelete' permission, stewards?
-Gurch
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
People have suggested bureaucrats - a poor idea, that.
On Jan 16, 2008 8:18 PM, Majorly axel9891@googlemail.com wrote:
On 17/01/2008, Matthew Britton matthew.britton@btinternet.com wrote:
Sounds good. Who's getting the 'bigdelete' permission, stewards?
People have suggested bureaucrats - a poor idea, that.
Don't give the permission to anybody. A better idea is to fix big deletions so that they don't bork the server when invoked, and then remove the 'bigdelete' permission entirely. Flag a page as being "in dispose" while a background process slowly grinds through the deletions. Ideally, this situation won't come up too frequently.
--Andrew Whitworth
Andrew Whitworth schreef:
On Jan 16, 2008 8:18 PM, Majorly axel9891@googlemail.com wrote:
On 17/01/2008, Matthew Britton matthew.britton@btinternet.com wrote:
Sounds good. Who's getting the 'bigdelete' permission, stewards?
People have suggested bureaucrats - a poor idea, that.
Don't give the permission to anybody. A better idea is to fix big deletions so that they don't bork the server when invoked, and then remove the 'bigdelete' permission entirely. Flag a page as being "in dispose" while a background process slowly grinds through the deletions. Ideally, this situation won't come up too frequently.
Wouldn't it be possible to change the way deletions work? Currently, when a page is deleted, the following happens:
- The related entry is deleted from the page table (1 row affected) - All corresponding rows in the revision table are copied (using INSERT SELECT) to the archive table (N rows affected, fetching of an additional N rows required) - The revision table rows are then deleted (N rows affected) This adds up to a total of 2N+1 rows being inserted/deleted and another N rows being selected.
Two improvements could be made: - Finally start using the rev_deleted field rather than the archive table. This changes the INSERT SELECT to a UPDATE WHERE on the revision table. This only affects N rows rather than 2N, and doesn't require SELECTing any rows. - Delete the page table entry immediately (making the page and its revisions invisible), and schedule moving/rev_deleting the revisions in the job queue. This will severely reduce the load of a delete request, but will delay the old revisions showing up in the undelete pool (the "undelete N deleted revisions?" link), making it hard/impossible to undelete a page shortly after deleting it. A solution could be to move/rev_delete the first revision immediately (i.e. right after deleting the page table entry) as well, so at least the most recent revision can be undeleted.
Roan Kattouw (Catrope)
- Delete the page table entry immediately (making the page and its
revisions invisible), and schedule moving/rev_deleting the revisions in the job queue. This will severely reduce the load of a delete request, but will delay the old revisions showing up in the undelete pool (the "undelete N deleted revisions?" link), making it hard/impossible to undelete a page shortly after deleting it. A solution could be to move/rev_delete the first revision immediately (i.e. right after deleting the page table entry) as well, so at least the most recent revision can be undeleted.
If you were really clever you could find the revisions in the job queue and then undelete them by just removing them from the queue - cancelling the deletion rather than undeleting it, in effect.
On 1/18/08, Roan Kattouw roan.kattouw@home.nl wrote:
Two improvements could be made:
- Finally start using the rev_deleted field rather than the archive
table. This changes the INSERT SELECT to a UPDATE WHERE on the revision table. This only affects N rows rather than 2N, and doesn't require SELECTing any rows.
It would still take much too long for the Sandbox, or anything similarly large. As Domas pointed out to me, the change requires seeking through most of the indexes in their entirety: only a couple (like the primary key, and therefore also the table itself) exhibit any kind of locality with respect to rev_page.
- Delete the page table entry immediately (making the page and its
revisions invisible), and schedule moving/rev_deleting the revisions in the job queue. This will severely reduce the load of a delete request, but will delay the old revisions showing up in the undelete pool (the "undelete N deleted revisions?" link), making it hard/impossible to undelete a page shortly after deleting it. A solution could be to move/rev_delete the first revision immediately (i.e. right after deleting the page table entry) as well, so at least the most recent revision can be undeleted.
Here's an O(1) idea that River once suggested for deleting an entire page, or an entire page minus a few recent revisions. Have an extra field in page, say page_first_good or something of that sort. Use it as follows:
* When a new page is created (whether it's actually not in the database, or was just deleted), set it to the rev_id of the new revision. * If a page is partially undeleted, change the value of page_first_good to the smallest undeleted rev_id, and mark all subsequent deleted revisions with rev_deleted. * When a page is deleted, set page_first_good to 2^63-1 or whatever the maximum reasonable rev_id is (and make sure no revisions actually get this high!). Call this marker value PAGE_NONEXISTENT, say. * When looking for pages that exist, add an extra condition page_first_good != PAGE_NONEXISTENT. * When looking for non-deleted revisions corresponding to a particular non-deleted page, use the additional join condition rev_id >= page_first_good. (This is an extra range scan, but since it should only be used in join conditions I don't *think* it should cause filesorts.)
Now, all of these are hopefully a logarithmic-factor slowdown at worst, i.e., you'd have to expand existing indexes, *except* for partial undeletion: the worst case would be deleting the whole page and then undeleting only the first revision, which would be as bad as the rev_deleted scenario. The result is that the common operation of deleting an entire page, or deleting an entire page except for a few recent revisions, is fast.
Of course, this is a rather intricate plan. The simpler, but of course less robust, idea would be to just run the deletions as a job, as you say. The problem is, of course, that the deletion is no longer synchronous; and you can't do the jobs on demand, because that would bring us back to the present situation as soon as anyone views the history page. This is quite possibly acceptable.
A slight refinement (not as dramatic as the plan I stole from River) might be to mark the page row deleted and totally inaccessible until the jobs are done, except for undeletion or those with the right to view deleted revisions. This would have the negative side effect of stopping its recreation for quite possibly some time, if it's a large page, but that may be more desirable than allowing it to be accessible but constantly changing as more rows get marked deleted.
In any case, just switching to rev_deleted should be something of an improvement, even if we keep on with the simplistic synchronous way we do deletions now. But it's still not great. Ultimately, doing *anything* to N arbitrary revisions will probably have to be O(N) in the worst case, no matter how clever you are. The best thing you can do is optimize for the common cases, and prevent the bad cases.
On 1/16/08, Matthew Britton matthew.britton@btinternet.com wrote:
Sounds good. Who's getting the 'bigdelete' permission, stewards?
As he said, no group currently has the privileges. There's probably no point in giving it to anyone: if the page is too big, it's too big, you can't avoid locking the database for minutes. In that case you should either work around it (why would you need to delete so many revisions?) or get someone with database access to do it for you in some fashion.
Would it make sense to identify pages with "too many" revisions and have them archived and protected by the individual wikis? I just wonder what happens if - for example - private data is published on such a page, that should be deleted. Nobody is able to delete the revision without slowing down the whole server. How many pages are that are that large?
On 17/01/2008, Brion Vibber brion@wikimedia.org wrote:
A couple times a year (such as about an hour ago) somebody does something like trying to delete the Wikipedia:Sandbox on en.wikipedia.org, which reaaalllly bogs down the server due to the large number of revisions.
While there are warnings about this, I'm hacking in some limits which will restrict such deletions to keep the system from falling over accidentally.
At the moment I've set the limit at 5000 revisions (as $wgDeleteRevisionsLimit). The error message right now is generic and there's no override group with 'bigdelete' privilege live, but it should be prettified soon.
(Note -- the revision count is done with an index estimate currently, so could overestimate on some pages possibly.)
-- brion
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 19/01/2008, Hei Ber hei.ber.wikimedia@googlemail.com wrote:
Would it make sense to identify pages with "too many" revisions and have them archived and protected by the individual wikis? I just wonder what happens if - for example - private data is published on such a page, that should be deleted. Nobody is able to delete the revision without slowing down the whole server. How many pages are that are that large?
For now, you can just use oversight.
On 1/17/08, Brion Vibber brion@wikimedia.org wrote:
A couple times a year (such as about an hour ago) somebody does something like trying to delete the Wikipedia:Sandbox on en.wikipedia.org, which reaaalllly bogs down the server due to the large number of revisions.
This might not quite be relevant, but would it perhaps be possible to stop people saving to the sandbox? Since the advent of Preview, I don't really ever see a need to actually commit any changes there. Or perhaps old revisions (10 or more ago) could continually be deleted anyway. The old revisions aren't useful or interesting.
Steve
On Jan 20, 2008 8:18 PM, Steve Bennett stevagewp@gmail.com wrote:
On 1/17/08, Brion Vibber brion@wikimedia.org wrote:
A couple times a year (such as about an hour ago) somebody does something like trying to delete the Wikipedia:Sandbox on en.wikipedia.org, which reaaalllly bogs down the server due to the large number of revisions.
This might not quite be relevant, but would it perhaps be possible to stop people saving to the sandbox? Since the advent of Preview, I don't really ever see a need to actually commit any changes there. Or perhaps old revisions (10 or more ago) could continually be deleted anyway. The old revisions aren't useful or interesting.
There is some merit to this idea. Consider a new type of "temporary page", which simply doesn't store revision histories, or only stores a very limited number of them. Pages like the sandbox, or discussion pages which are archived by copy+paste to other pages don't need to be storing extensive revision histories. Not storing revision histories would make deletion operations trivial, and would also save database space on storing revisions which are simply never viewed.
Since such pages will be relatively rare, and because not storing revisions has implications for licensing, the ability to mark them could be reserved to bureaucrats or others high up the permissions hierarchy.
--Andrew whitworth
Andrew Whitworth wrote:
On Jan 20, 2008 8:18 PM, Steve Bennett wrote:
This might not quite be relevant, but would it perhaps be possible to stop people saving to the sandbox? Since the advent of Preview, I don't really ever see a need to actually commit any changes there. Or perhaps old revisions (10 or more ago) could continually be deleted anyway. The old revisions aren't useful or interesting.
There is some merit to this idea. Consider a new type of "temporary page", which simply doesn't store revision histories, or only stores a very limited number of them. Pages like the sandbox, or discussion pages which are archived by copy+paste to other pages don't need to be storing extensive revision histories. Not storing revision histories would make deletion operations trivial, and would also save database space on storing revisions which are simply never viewed.
Since such pages will be relatively rare, and because not storing revisions has implications for licensing, the ability to mark them could be reserved to bureaucrats or others high up the permissions hierarchy.
--Andrew whitworth
This thread then focused on the reasons why you need to be able to save. The sandbox. However, the original point was making Sandbox not store the full history. Something that i have also thought before.
Some point though: -Needs to be a LocalSettings option, not for bureaucrats but sysadmins. -You don't want pages to be moved there. -You lose user contributions (to tell if X user has edited).
Steve Bennett wrote:
Since the advent of Preview, I don't really ever see a need to actually commit any changes there.
There is at least one thing that Preview doesn't do that is done on a save - the display of [edit] links. This can be annoying if you're trying to fix the problem of edit links wandering around due to images being poorly placed. I don't know if there are other differences. I'd suggest holding off the elimination of a save option for Sandbox until these are changed.
Also - Sandbox is not a special page - it is just an arbitrarily-named page. On some wikis, this is given another name (possibly relevant to the users of the wiki as a play area). Unless there is a way to designate the page that Wikipedia calls "Sandbox", the code will force all wikis to use that name. Perhaps we need an alias like mainpage.
Mike
On Jan 21, 2008 3:29 PM, Michael Daly michael_daly@kayakwiki.org wrote:
Steve Bennett wrote:
Since the advent of Preview, I don't really ever see a need to actually commit any changes there.
There is at least one thing that Preview doesn't do that is done on a save - the display of [edit] links.
You also can't test displaytitle (or the {{lowercase}} template) on preview. That only works after saving.
Angela
Also, sandbox (or a Template sandbox) can be used when trying to fix a bug in a template.
On 1/21/08, Angela beesley@gmail.com wrote:
On Jan 21, 2008 3:29 PM, Michael Daly michael_daly@kayakwiki.org wrote:
Steve Bennett wrote:
Since the advent of Preview, I don't really ever see a need to actually commit any changes there.
There is at least one thing that Preview doesn't do that is done on a save - the display of [edit] links.
You also can't test displaytitle (or the {{lowercase}} template) on preview. That only works after saving.
Angela
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 21/01/2008, Angela beesley@gmail.com wrote:
On Jan 21, 2008 3:29 PM, Michael Daly michael_daly@kayakwiki.org wrote:
Steve Bennett wrote:
Since the advent of Preview, I don't really ever see a need to actually commit any changes there.
There is at least one thing that Preview doesn't do that is done on a save - the display of [edit] links.
You also can't test displaytitle (or the {{lowercase}} template) on preview. That only works after saving.
Or anything using subst:.
Only keeping the last 50 (or, better yet, a customisable number of) edits would be fine, I imagine - there could be a special page where you designate certain pages as temporary (I would default it to requiring sysop rights, since it's basically a lesser but automated form of delete).
wikitech-l@lists.wikimedia.org