many-wiki spam solution

List overview All Threads
Download

newer

older

Another demo of client-side wiki...

RE: [Wikitech-l] French...

Brian

18 Mar 2005 18 Mar '05

7:12 a.m.

Small wikis no-doubt get the most spam. I run several smallish wikis and removing spam from them is nearly an everyday occurance. Take for instance the bomis wiki. [1]. Scroll down to the bottom and take a look at how hard it is getting hit (that's use mod)

So the idea is to build into the mediawiki software an interwiki watchlist, hosted by the foundation. This requires little modification as we can use the RSS feed that is already generated by the software and found at RC. A simple opt-in during the install of the software and a message will be sent home to retrieve the feed every so often.

What it is appropriate to do after that, i'm not sure. But there are a lot of ways it could go.

Just wanted to throw this out there and see if it has merit. I think it is a simple solution at identifying lots of spam around the web.

/Alterego

[1] http://www.bomis.com/cgi-bin/wiki.pl?RecentChanges

-- "I'd hate to die twice. It's so boring." - Richard Feynman's last words -- http://www.br1an.net (website) http://www.gawr.com/br1annet/gallery2 (photos)

Show replies by date

Evan Prodromou

19 Mar 19 Mar

12:20 a.m.

On Thu, 2005-17-03 at 20:12 -0700, Brian wrote:

...

Small wikis no-doubt get the most spam. I run several smallish wikis and removing spam from them is nearly an everyday occurance. Take for instance the bomis wiki. [1]. Scroll down to the bottom and take a look at how hard it is getting hit (that's use mod)

So the idea is to build into the mediawiki software an interwiki watchlist, hosted by the foundation. This requires little modification as we can use the RSS feed that is already generated by the software and found at RC. A simple opt-in during the install of the software and a message will be sent home to retrieve the feed every so often.

What it is appropriate to do after that, i'm not sure. But there are a lot of ways it could go.

Just wanted to throw this out there and see if it has merit. I think it is a simple solution at identifying lots of spam around the web.

It's probably worth noting that there's already an excellent shared regexp list here:

http://www.emacswiki.org/cw/BannedContent

It's based on experience of Wiki admins and maintainers as well as some other centralized lists (MoinMoin main list, I believe).

Wikitravel uses this list and I'd be happy to package the software that implements the check as an extension if there's interest.

~Evan

-- Evan Prodromou evan@bad.dynu.ca

Rowan Collins

1:36 a.m.

On Fri, 18 Mar 2005 15:20:34 -0500, Evan Prodromou evan@bad.dynu.ca wrote:

...

It's probably worth noting that there's already an excellent shared regexp list here:

http://www.emacswiki.org/cw/BannedContent

For the record, Wikimedia also has a blacklist (which only matches inside URLs) at http://meta.wikimedia.org/wiki/Spam_blacklist

Perhaps this too could be combined with others using an auto-update + whitelist, as described at CommunityWiki (an ingenious system, I must say). Although, I note that cw's list is quite large, and there seems to be some legitimate concern at over-general entries (e.g. is *everything* under .uk.net *really* going to be spam?), so it might need heavy weeding before going "live" for something as large as {the Wikimedia group of sites}

-- Rowan Collins BSc [IMSoP]

Tels

2:02 p.m.

-----BEGIN PGP SIGNED MESSAGE-----

Moin,

On Friday 18 March 2005 21:20, Evan Prodromou wrote:

...

On Thu, 2005-17-03 at 20:12 -0700, Brian wrote:

...
Small wikis no-doubt get the most spam. I run several smallish wikis and removing spam from them is nearly an everyday occurance. Take for instance the bomis wiki. [1]. Scroll down to the bottom and take a look at how hard it is getting hit (that's use mod)

So the idea is to build into the mediawiki software an interwiki watchlist, hosted by the foundation. This requires little modification as we can use the RSS feed that is already generated by the software and found at RC. A simple opt-in during the install of the software and a message will be sent home to retrieve the feed every so often.

What it is appropriate to do after that, i'm not sure. But there are a lot of ways it could go.

Just wanted to throw this out there and see if it has merit. I think it is a simple solution at identifying lots of spam around the web.

It's probably worth noting that there's already an excellent shared regexp list here:

http://www.emacswiki.org/cw/BannedContent

Technical note:

... foo.com foos?.com foos.com ...

This list contains many rendund regexps, foos?.com will match exactly (and only) foo.com and foos.com, so there is no need to list them again. Likewise, it would probably a good idea to combine many of the regexps like:

... foo.com bar.com ...

into (foo|bar).com - this would reduce the number of matches to be done, and so speed up the matching process. Similiar:

foo.(com|net|org)

instead of

foo.com bar.com

Of course, combined regexps are harder to edit, but matching 10000 small/short regexps takes way longer than matching 100 one with 100 alternations each.

Best wishes,

Tels

- -- Signed on Sat Mar 19 10:51:37 2005 with key 0x93B84C15. Visit my photo gallery at http://bloodgate.com/photos/ PGP key on http://bloodgate.com/tels.asc or per email.

"My other computer is your Windows box." -- Dr. Brad (19034) on 2004-08-13 at /.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iQEVAwUBQjv4yXcLPEOTuEwVAQFxPwf8ConMNlhe8+i7KQjDE0Jr9XWtBRG+6e4J WbV2405pdZihl+cVyuGI5vVIYDKu/M4Pa2F06hrquNioZJ30AHx0ZIp7Cmvbcv8Z L8BA7rfSe5saoKU780F/8UbyzmwyjXWhNfMp6+vMWDpHneBpePTqADWnuY55PcYl wIviFAyRcCiCnQiuimUrrNOxl8IHPl3Ak/HkF/+g7ayOcJOuTmP+DQctR75NfXcD elUXdUgBVKbQXxyEhVJ9CLn7zOGQfai2pdGVGf0HjVsubg8R5MTR2ONmG5v9YtWb HNXB/TzjQpgB+emzPozOk7vOjZAnTXdDi3Wlf7sNhBqcSb4oHCGd2g== =6dTE -----END PGP SIGNATURE-----

7232

Age (days ago)

7233

Last active (days ago)

wikitech-l@lists.wikimedia.org

3 comments

4 participants

tags (0)

participants (4)

Brian
Evan Prodromou
Rowan Collins
Tels