I've just completed the initial version of a "Combating spam" page on MediaWiki.org, which can be found at http://www.mediawiki.org/wiki/Manual:Combating_spam. I'm hoping that with the right incoming links, it will become an introduction to dealing with the issue for new wiki operators, since anti-spam measures are one of our more commonly asked questions.
I'm throwing the URL out here both to gather feedback, and to make it known to the user base at large.
Rob Church
Rob Church wrote:
I'm throwing the URL out here both to gather feedback, and to make it known to the user base at large.
What about the use of "noindex" to prevent the indexing of old versions of pages? I read about this on (a somewhat out-of-date) chongqed.org page, but I don't think I understand it fully. There is also a reference to using robots.txt to prevent this, but that's even less clear (links on the page appear to be broken).
If we zap a page that's been spammed and the search engines still find it, the spammers can find your wiki as one with old spam and will re-spam.
Apparently, they can use the "recent changes" pages too. I find that the search engines are frequently accessing recent changes, but I'm not sure how to stop that.
Mike
On 01/05/07, Michael Daly mikedaly@magma.ca wrote:
What about the use of "noindex" to prevent the indexing of old versions of pages? I read about this on (a somewhat out-of-date) chongqed.org
We do this, as far as I'm aware.
Apparently, they can use the "recent changes" pages too. I find that the search engines are frequently accessing recent changes, but I'm not sure how to stop that.
Special pages should all be emitting appropriate <meta> tags with "noindex,nofollow" set, so search engines *oughtn't* to be indexing or following links from these.
Rob Church
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Tuesday 01 May 2007 08:07:13 Rob Church wrote:
On 01/05/07, Michael Daly mikedaly@magma.ca wrote:
What about the use of "noindex" to prevent the indexing of old versions of pages? I read about this on (a somewhat out-of-date) chongqed.org
We do this, as far as I'm aware.
Apparently, they can use the "recent changes" pages too. I find that the search engines are frequently accessing recent changes, but I'm not sure how to stop that.
Special pages should all be emitting appropriate <meta> tags with "noindex,nofollow" set, so search engines *oughtn't* to be indexing or following links from these.
Yeah, but they will still grabber them, causing traffic. And then there are the bots who don't obay robots.txt or "noindex, nofollow"...
I came up with this:
User-agent: BecomeBot User-agent: gonzo User-agent: NPBot User-agent: TMCrawler Disallow: / User-agent: googlebot Crawl-delay: 30 Disallow: /wiki/index.php?title=Special: Disallow: /wiki/index.php?title=Internal: Disallow: /wiki/index.php?title=MediaWiki: ... User-agent: * Crawl-delay: 120 Disallow: /wiki/ ...
forbidding MSN and Yahoo the wiki completely, as the three big search engines together caused about 90% of the traffic to my small wiki, going through every old page revision (from special recentchanged) etc.
If you have a smaller wiki, teergrubing certain user-agents (like "Java", "larbot", "-" etc.) might also make extreme sense. See http://bloodgate.com/drowns/example for the effect this has :)
All the best,
Tels
- -- Signed on Tue May 1 10:23:37 2007 with key 0x93B84C15. Get one of my photo posters: http://bloodgate.com/posters PGP key on http://bloodgate.com/tels.asc or per email.
"A witty saying proves nothing."
-- Voltaire
mediawiki-l@lists.wikimedia.org