Nevertheless, the vast majority of spam on inactive _Wikipedias_ is from unloggedin users.
But only because we don't make them log in, not because it's hard to do so. It's far more of a deterrent to genuine editors than to spam bots.
Unfortunately, that's probably true.
Blocking "that-medicine-that-starts-with-c" will prevent anyone writing about socialism (which was rather a problem for socialism.wikicities.com) :)
See-eye-a-ell-eye-ess is related to _socialism_??
At the risk of ending up in everyone's spam bins, I'll spell it out: "so...Cialis...m". Blocking the word blocks any words that contain it.
Ohh. Duh. But, surely, it would take only a few lines of code to add a feature so that it only blocked the _whole word_?
I mean, does anybody get spam e-mails that say "Free socialism! Click here now"... or even "Get free soCialiSm! cl**k here no*" or anything like that? I don't think spammers are sophisticated enough to realise that there are legitimate words that contain spam-filter'd words.
Also, there is the occurance of _phrases_: "free (name of product or medicine)" is significantly more likely to be spam than even "(name of product or medicine) is a". If you add "get" before the "free", that is even more likely (exponentially?) to be spam. Add a "now" afterwards, and more likely. Add "by" after that, or "for"... For the _second_ one, add the word "nat**al". Then add "h*rb*l"... then "s*p*l*m't", then "for", then "m*le", then that word that you know oh-so-well comes next due to the extreme odds!
Of course, anything that filtered on something as complex as this would be very, very complex programming.
Perhaps instead, somebody could adapt a Free numerical rating system for spam e-mails (which gives "likelyhoods" that e-mails are spam) -- Google may or may not be willing to help out there given how massive their database must be and their commitment to Goodness on the Internet, but if not there would be another project I'm sure.
From that, some things could be adapted. For example, the "from",
"to", and "cc" lines aren't present, and neither is the subject. HTML codes would have to have aliases using WikiCode. Things which might be "automatic kill" for a spam killer would, in many instances, have to be significantly downgraded, at least for the English Wikipedia (for example, the-medicine-that-starts-with-c is a legitimate topic, but in very limited contexts). Talk pages would have to give a certain degree of slack. The greater the length of a page, the more times its title should occur within it, or *related* terms (ie, links to articles which link back to it). So, to a certain extent, "subject" and "article title" would correspond, although the length-title ratio would be significantly different.
Certain IPs would be greylisted based on the relative frequency of spam from them. In fact, every IP range would be assigned a %age based on existing data. If 90% of the content from an IP range is spam, the system might notice if any subranges or particular IPs had a significantly less frequency, and if they did, semiwhitelist them (ie, "good" percentage points). An IP range with 90% of submissions legitimate, on the other hand, would have "good" points. If there were any particular subranges or IPs with a significantly higher perentage of spam, they would be semiblacklisted ("bad" percentage points, or less "good" percentage points, depending on the exact frequency).
I'm going into too much detail here, and obviously it would be a massive undertaking, but given the massive amount of work it would solve, it's not the sort of pipe dream that I feel guilty bringing up in front of people who could actually bring it to fruition (I know I couldn't without learning a programming language first -- right now, I have very rusty Qbasic, medium-to-advanced HTML, a bit of UNL, but nothing else, and the latter two aren't exaclty programming languages).
if I were to receive an e-mail every two hours with a list of suspicious edits, I could revert them immediately as nessecary.
I'd also be more likely to check edits sent to me by email. Perhaps http://meta.wikimedia.org/wiki/EmailNotification could be adapted. Currently, I don't think it will send diffs, and there's no way of filtering for "suspicious" edits.
Ahh, but there are already three-halves party applications (meaning, by Wikipedians, but not software integrated to MeW) which monitor for "suspicious" edits. Nothing complex, but helpful nonetheless in filtering out The Good Edits to give only the bad ones, based on a few very basic observations, as well as feedback.
Mark