Small wikis no-doubt get the most spam. I run several smallish wikis and removing spam from them is nearly an everyday occurance. Take for instance the bomis wiki. [1]. Scroll down to the bottom and take a look at how hard it is getting hit (that's use mod)
So the idea is to build into the mediawiki software an interwiki watchlist, hosted by the foundation. This requires little modification as we can use the RSS feed that is already generated by the software and found at RC. A simple opt-in during the install of the software and a message will be sent home to retrieve the feed every so often.
What it is appropriate to do after that, i'm not sure. But there are a lot of ways it could go.
Just wanted to throw this out there and see if it has merit. I think it is a simple solution at identifying lots of spam around the web.
/Alterego
[1] http://www.bomis.com/cgi-bin/wiki.pl?RecentChanges
On Thu, 2005-17-03 at 20:12 -0700, Brian wrote:
Small wikis no-doubt get the most spam. I run several smallish wikis and removing spam from them is nearly an everyday occurance. Take for instance the bomis wiki. [1]. Scroll down to the bottom and take a look at how hard it is getting hit (that's use mod)
So the idea is to build into the mediawiki software an interwiki watchlist, hosted by the foundation. This requires little modification as we can use the RSS feed that is already generated by the software and found at RC. A simple opt-in during the install of the software and a message will be sent home to retrieve the feed every so often.
What it is appropriate to do after that, i'm not sure. But there are a lot of ways it could go.
Just wanted to throw this out there and see if it has merit. I think it is a simple solution at identifying lots of spam around the web.
It's probably worth noting that there's already an excellent shared regexp list here:
http://www.emacswiki.org/cw/BannedContent
It's based on experience of Wiki admins and maintainers as well as some other centralized lists (MoinMoin main list, I believe).
Wikitravel uses this list and I'd be happy to package the software that implements the check as an extension if there's interest.
~Evan
On Fri, 18 Mar 2005 15:20:34 -0500, Evan Prodromou evan@bad.dynu.ca wrote:
It's probably worth noting that there's already an excellent shared regexp list here:
For the record, Wikimedia also has a blacklist (which only matches inside URLs) at http://meta.wikimedia.org/wiki/Spam_blacklist
Perhaps this too could be combined with others using an auto-update + whitelist, as described at CommunityWiki (an ingenious system, I must say). Although, I note that cw's list is quite large, and there seems to be some legitimate concern at over-general entries (e.g. is *everything* under .uk.net *really* going to be spam?), so it might need heavy weeding before going "live" for something as large as {the Wikimedia group of sites}
-----BEGIN PGP SIGNED MESSAGE-----
Moin,
On Friday 18 March 2005 21:20, Evan Prodromou wrote:
On Thu, 2005-17-03 at 20:12 -0700, Brian wrote:
Small wikis no-doubt get the most spam. I run several smallish wikis and removing spam from them is nearly an everyday occurance. Take for instance the bomis wiki. [1]. Scroll down to the bottom and take a look at how hard it is getting hit (that's use mod)
So the idea is to build into the mediawiki software an interwiki watchlist, hosted by the foundation. This requires little modification as we can use the RSS feed that is already generated by the software and found at RC. A simple opt-in during the install of the software and a message will be sent home to retrieve the feed every so often.
What it is appropriate to do after that, i'm not sure. But there are a lot of ways it could go.
Just wanted to throw this out there and see if it has merit. I think it is a simple solution at identifying lots of spam around the web.
It's probably worth noting that there's already an excellent shared regexp list here:
Technical note:
... foo.com foos?.com foos.com ...
This list contains many rendund regexps, foos?.com will match exactly (and only) foo.com and foos.com, so there is no need to list them again. Likewise, it would probably a good idea to combine many of the regexps like:
... foo.com bar.com ...
into (foo|bar).com - this would reduce the number of matches to be done, and so speed up the matching process. Similiar:
foo.(com|net|org)
instead of
foo.com bar.com
Of course, combined regexps are harder to edit, but matching 10000 small/short regexps takes way longer than matching 100 one with 100 alternations each.
Best wishes,
Tels
- -- Signed on Sat Mar 19 10:51:37 2005 with key 0x93B84C15. Visit my photo gallery at http://bloodgate.com/photos/ PGP key on http://bloodgate.com/tels.asc or per email.
"My other computer is your Windows box." -- Dr. Brad (19034) on 2004-08-13 at /.
wikitech-l@lists.wikimedia.org