On 10/16/10 8:40 PM, Fred Bauder wrote:
The linked blog post laments the lag between the removal of vandalism on
Wikipedia and its removal in Google's indices and cached data.
There is a way to mitigate that problem -- there are protocols to let
Google know about recently changed pages. I'm assuming that we have no
arrangement in place already for them to crawl recent changes for all of
Wikipedia?
In any case, the more interesting goal is not so much to mitigate
vandalism, but to increase the coverage and timeliness of the whole
collection.
Anyway, the standard way to do this is Sitemaps:
http://www.sitemaps.org/
As the name suggests, "Sitemaps" were originally intended as hints about
site structure, but search engines like Google now use it as a sort of
feed of recently changed pages.
http://www.sitemaps.org/faq.php#faq_submitting_changes
They don't accept something sensible like RSS or XMPP even from other
top 50 websites, unless you happen to be Six Apart or Twitter. Still, we
could ask, since Daniel Kinzler has a working demo of recent changes via
XMPP.
Alternatively, we could use the XMPP stream to either transform it to a
Sitemaps-compatible structure or generate both kinds of files at the
same time. I assume, famous last words, that the really heavy lifting is
already done since we have a recent changes feature.
I don't know if I'm committing any resources to this (I'm still busy
with other stuff for the next two months at least) but I happen to know
a lot about this from an aborted project at another employer, so I have
always wanted to actually use that knowledge.
--
Neil Kandalgaonkar |) <neilk(a)wikimedia.org>