On 5/7/07, Gregory Maxwell gmaxwell@gmail.com wrote:
This would be very useful for another use case: Sometimes google will pick up a cached copy of a vandalized page. In order to purge the google cache you need to make the page 404 (which deletion doesn't do), put the page into a robots.txt deny, or include some directive in the page that stops indexing.
If we provided some directive to do one of the latter two (ideally the last) we could use it temporally to purge google cached copies of vandalism... so it would even be useful for pages that we normally want to keep indexed.
For cases of vandalism, just changing indexing-related metadata on the page won't help. Because Googlebot must fetch the page to discover it's a 404, then you might as well just serve a reverted/corrected page instead. It seems that if someone is going to the trouble to flip the "has been vandalized" bit they could also just revert the page to a pre-vandalism state.
You can learn more about getting pages deindexed from Google here: http://www.google.com/support/webmasters/bin/topic.py?topic=8459 There is a form for submitting URL removal requests, but in that case the URL won't be crawled again for six months. It's intended for emergencies where the content should completely disappear, like "I accidentally put all my customers' credit card numbers in a world-readable directory!" I don't think it's useful for this discussion.
I think the best solution for vandalism is to set up a system that allows pages to be marked as needing expedited recrawling. This wouldn't be for every updated page -- just those that someone with sufficient access (an admin?) had explicitly marked. (It'd be best if this information were *not* pushed directly to Google, because ideally every search engine would be able to make use of it.) We're currently looking into ways to let webmasters provide this sort of information. I'll get back to you when there's news.
(Disclaimer: I'm not an official company spokesman.)