* Neil Kandalgaonkar <neilk(a)wikimedia.org> [Sat, 16 Oct 2010 22:20:50
-0700]:
On 10/16/10 8:40 PM, Fred Bauder wrote:
http://mastersofmedia.hum.uva.nl/2010/10/16/wikipedia-we-have-a-google-refr…
The linked blog post laments the lag between the removal of vandalism
on
Wikipedia and its removal in Google's indices and
cached data.
I am completely disconnected from Wikipedia - I do use MediaWiki for
small projects. However, wasn't there FlaggedRevs deployed at Wikipedia
for some time already? If so, how can Google (which should index pages
as anonymous), recieves vandalized pages instead of approved revisions?
Only registered users should see vandalism by default.
There is a way to mitigate that problem -- there are
protocols to let
Google know about recently changed pages. I'm assuming that we have no
arrangement in place already for them to crawl recent changes for all
of
Wikipedia?
In any case, the more interesting goal is not so much to mitigate
vandalism, but to increase the coverage and timeliness of the whole
collection.
Anyway, the standard way to do this is Sitemaps:
http://www.sitemaps.org/
As the name suggests, "Sitemaps" were originally intended as hints
about
site structure, but search engines like Google now use
it as a sort of
feed of recently changed pages.
Sitemaps tend to grow huge, however MediaWiki has a sitemap generator
(in /maintenance). I use it for small wikis to improve "coverage".
However currently they are non-incremental so for a such huge wiki it
would take a lot of time to produce them.
http://www.sitemaps.org/faq.php#faq_submitting_changes
They don't accept something sensible like RSS or XMPP even from other
top 50 websites, unless you happen to be Six Apart or Twitter. Still,
we
could ask, since Daniel Kinzler has a working demo of
recent changes
via
XMPP.
Yahoo has it's own sitemap which are extended version of RSS (MRSS),
Google should support it as well?
That would especially be useful for commons media. Also ordinary text
pages probably, too.
Dmitriy