Hi all,
As most people who are veterans of this list will know very well by now, I am strongly against the "locking" of inactive Wikipedias.
However, due to a recent increase in vandalism on inactive Wikipedias, I have decided to suggest a few more minor technical sanctions:
1) Not allowing people to add links to external URLs to pages without first logging in. If they tried to, it would be removed when they clicked "submit", and they would get a warning message. 2) Not allowing people to add #REDIRECT ... to pages without first logging in. If they tried to, they'd get a "preview" window and a warning message. 3) Not allowing people to replace, say, 30kb of text with two words, without first logging in. If they tried to, they'd get a warning message and woudln't be able to submit it. 4) Not allowing ANYBODY to have one or more external URLs as the sole contents of a page, except perhaps stewards or admins 5) Not allowing anybody to add words like "motherf*cker", "c*nt", "as*hole", "sh*t", that-medicine-that-starts-with-v, that-medicine-that-starts-with-c (including replacing the "a" with an @ sign), "pha ... ceu ... cals", or strings such as "is gay", "is so gay", etc. While allowing such things may be needed on larger Wikipedias, on _inactive_ Wikipedias it would prevent probably 50% of vandalism.
Of course, these sanctions would only be applied to inactive Wikipedias, which are a definite list.
People like myself, Chamdarae, and others, notice when inactive Wikipedias become active, and we would be able to ask once every month or two for the newly active WPs to be removed from that list.
Please note that this would NOT have any effect on ACTIVE Wikipedias, including ones as small as Yiddish or Udmurt, or as large as English and German, or anywhere in between. (thus, any WP with perhaps at least 2-3 edits a week, over 500 articles, would not have such things).
I am curious to see the reactions of others on this.
It solves many of the problems caused by inactive WPs:
1) Vandalism, spamming. 2) The first proposed solution to problem #1 is to lock all inactive Wikis against edits. This would prohibit any development unless somebody requested them to be opened, which most people would be too unknowledgable or shy to do. 3) The second proposed solution to problem #1 is to _monitor_ inactive Wikis. This solution is currently implemented, but some of the basic vandalisms and spammings would be so easily preventable by such a technical solution, and would save loads of work for myself, Chamdarae, Angela, Wouter, Mxn, and whoever else checks inactive Wikipedias.
Thoughts?
-- "Take away their language and you are taking away their souls." -- Stalin
- Not allowing people to add links to external URLs to pages without
first logging in.
So, they log in, which doesn't solve anything. 21c3.ccc.de and is-root.de were hit with more of this wikitikitavi spam today, and it all came from logged in users. All this does is prevent people seeing the spammer's IP, making it harder to block them on other languages or other projects, and making it harder to find all the spam from one user when they change usernames more easily than they change IPs.
- Not allowing people to add #REDIRECT ... to pages without first
logging in. If they tried to, they'd get a "preview" window and a warning message.
Although redirects within a wiki are a little harder for non-admins to revert than normal vandalism, I don't think this problem is common enough to warrant more confusion to unregistered users. The lack of ability to move pages already causes problems. Wikimedia already disables cross-wiki and special page redirects, so I don't think we need more restrictions in this area.
- Not allowing people to replace, say, 30kb of text with two words,
without first logging in. If they tried to, they'd get a warning message and woudln't be able to submit it.
Wouldn't this make it harder to maintain these small wikis? They have no admins, so deleting a page means blanking it or putting {{delete}} on it for the benefit of some future admin. How can anyone clear up junk and copyvios if they can't easily (ie - without logging in) blank a page? Expecting people to log in on small wikis is more of a demand than expecting them to do so on a large one, since there is less benefit. If all you ever want to do is blank one page there, why make an account? And if you have an account, you probably don't use it often enough to stay logged in.
- Not allowing ANYBODY to have one or more external URLs as the sole
contents of a page, except perhaps stewards or admins
That would be so easy to get round that it would be pointless, and perhaps even damaging. If I see a page full of URLs, it's very quick to recognise it as spam. If we force the spammers to be clever and start mixing content with the spam, it's going to be harder to spot, especially if that content is in a language you don't know.
- Not allowing anybody to add words like "motherf*cker", "c*nt",
"as*hole", "sh*t", that-medicine-that-starts-with-v, that-medicine-that-starts-with-c (including replacing the "a" with an @ sign), "pha ... ceu ... cals", or strings such as "is gay", "is so gay", etc. While allowing such things may be needed on larger Wikipedias, on _inactive_ Wikipedias it would prevent probably 50% of vandalism.
I think there is a setting for this in MediaWiki already, but it's only editable by developers. You need to be very careful about what goes in it though. Blocking "that-medicine-that-starts-with-c" will prevent anyone writing about socialism (which was rather a problem for socialism.wikicities.com) :)
If the wikis are really inactive, wouldn't a system where every edit must be approved, or at least time-delayed, make more sense? Another option is to "retire" a wiki. It becomes non-editable, but has a button that allows anyone to relaunch it very quickly (perhaps with a captcha to prevent spam bots), with no knowledge needed, and no bureaucracy about new language edition policies. This could send a warning to the people monitoring inactive wikis.
It solves many of the problems caused by inactive WPs:
- Vandalism, spamming.
Which of those do you find is the more common problem on inactive/small wikis?
- The second proposed solution to problem #1 is to _monitor_ inactive Wikis
The reliance on unrelated third party tools to monitor these makes it much harder. In my experience, feed readers can't be relied on. The ones I've tried do not catch all edits, and and will not scale to more than about 300 wikis. This is something that MediaWiki, or a tool made specifically for wiki monitoring, needs to do for itself.
Angela.
- Not allowing people to add links to external URLs to pages without
first logging in.
So, they log in, which doesn't solve anything. 21c3.ccc.de and is-root.de were hit with more of this wikitikitavi spam today, and it all came from logged in users. All this does is prevent people seeing the spammer's IP, making it harder to block them on other languages or other projects, and making it harder to find all the spam from one user when they change usernames more easily than they change IPs.
Nevertheless, the vast majority of spam on inactive _Wikipedias_ is from unloggedin users. To write a spambot that registers and logs in at multiple inactive Wikipedias is more trouble. It won't solve isolated incidents, of course.
But it would hopefully act as a deterrent, removing so many inactive WPs as sitting ducks.
- Not allowing people to add #REDIRECT ... to pages without first
logging in. If they tried to, they'd get a "preview" window and a warning message.
Although redirects within a wiki are a little harder for non-admins to revert than normal vandalism, I don't think this problem is common enough to warrant more confusion to unregistered users. The lack of ability to move pages already causes problems. Wikimedia already disables cross-wiki and special page redirects, so I don't think we need more restrictions in this area.
I guess that's true. The only real problems have been people who are vandalising rather than spamming, such as redirecting [[Main Page]] to, say, [[Dsapiodfpioipoasdf]]. However, people who do this log in more often, and use the pagemove function instead. So far, Afar god (hit one Wiki), Willy on Wheels (hit 3 wikis so far), and a few other less memorable users (Afar god was actually a bit funny, and Willy on Wheels hit multiple Wikis).
- Not allowing people to replace, say, 30kb of text with two words,
without first logging in. If they tried to, they'd get a warning message and woudln't be able to submit it.
Wouldn't this make it harder to maintain these small wikis? They have no admins, so deleting a page means blanking it or putting {{delete}} on it for the benefit of some future admin. How can anyone clear up junk and copyvios if they can't easily (ie - without logging in) blank a page? Expecting people to log in on small wikis is more of a demand than expecting them to do so on a large one, since there is less benefit. If all you ever want to do is blank one page there, why make an account? And if you have an account, you probably don't use it often enough to stay logged in.
Although that's certainly true... hmm... the only possible solution to the issues you raise would be to whitelist certain IP ranges, which although feasible is probably not something that anybody would agree with for whatever reasons.
- Not allowing ANYBODY to have one or more external URLs as the sole
contents of a page, except perhaps stewards or admins
That would be so easy to get round that it would be pointless, and perhaps even damaging. If I see a page full of URLs, it's very quick to recognise it as spam. If we force the spammers to be clever and start mixing content with the spam, it's going to be harder to spot, especially if that content is in a language you don't know.
True...
- Not allowing anybody to add words like "motherf*cker", "c*nt",
"as*hole", "sh*t", that-medicine-that-starts-with-v, that-medicine-that-starts-with-c (including replacing the "a" with an @ sign), "pha ... ceu ... cals", or strings such as "is gay", "is so gay", etc. While allowing such things may be needed on larger Wikipedias, on _inactive_ Wikipedias it would prevent probably 50% of vandalism.
I think there is a setting for this in MediaWiki already, but it's only editable by developers. You need to be very careful about what goes in it though. Blocking "that-medicine-that-starts-with-c" will prevent anyone writing about socialism (which was rather a problem for socialism.wikicities.com) :)
See-eye-a-ell-eye-ess is related to _socialism_?? Not sure that medicine is widely used in the UK, but here it's used as a medicine for... well... that problem that some middle-aged guys have that would set off some spam filters if I referred to it by name. I have gotten plenty of spam about it, and so have Wikis. The pha ... ceu ... cals, however, needs to be added to that list if it does indeed exist -- lots of spam seems to include that word.
If the wikis are really inactive, wouldn't a system where every edit must be approved, or at least time-delayed, make more sense? Another option is to "retire" a wiki. It becomes non-editable, but has a button that allows anyone to relaunch it very quickly (perhaps with a captcha to prevent spam bots), with no knowledge needed, and no bureaucracy about new language edition policies. This could send a warning to the people monitoring inactive wikis.
It solves many of the problems caused by inactive WPs:
- Vandalism, spamming.
Which of those do you find is the more common problem on inactive/small wikis?
Spamming, by far. Spamming is usually done by bots, vandalism is usually done by (presumably) teenagers with nothing better to do who think it's really funny to replace the contents of the Fijian mainpage with "haha poop is a funny word", or move all the pages on the Xhosa Wikipedia to "(subject) is so gay", or whatever.
- The second proposed solution to problem #1 is to _monitor_ inactive Wikis
The reliance on unrelated third party tools to monitor these makes it much harder. In my experience, feed readers can't be relied on. The ones I've tried do not catch all edits, and and will not scale to more than about 300 wikis. This is something that MediaWiki, or a tool made specifically for wiki monitoring, needs to do for itself.
This is unfortunately the truth. It would be nice if there were some feature that listed suspicious edits separately, so that I or you or someone else could revert them ASAP, and less-suspicious edits (such as copyvios or pages in the wrong language) could be reverted within about 2 days.
I used to browse the list of inactive Wikis every few hours, but it became so inconvenient that I do it perhaps twice or even once every day now. I really don't feel that pages in the wrong language or copyvios are hugely crucial to be reverted ASAP, but I do check my e-mail often enough that if I were to receive an e-mail every two hours with a list of suspicious edits, I could revert them immediately as nessecary.
Mark
On 11/4/05, Mark Williamson node.ue@gmail.com wrote:
Nevertheless, the vast majority of spam on inactive _Wikipedias_ is from unloggedin users.
But only because we don't make them log in, not because it's hard to do so. It's far more of a deterrent to genuine editors than to spam bots.
Blocking "that-medicine-that-starts-with-c" will prevent anyone writing about socialism (which was rather a problem for socialism.wikicities.com) :)
See-eye-a-ell-eye-ess is related to _socialism_??
At the risk of ending up in everyone's spam bins, I'll spell it out: "so...Cialis...m". Blocking the word blocks any words that contain it.
if I were to receive an e-mail every two hours with a list of suspicious edits, I could revert them immediately as nessecary.
I'd also be more likely to check edits sent to me by email. Perhaps http://meta.wikimedia.org/wiki/EmailNotification could be adapted. Currently, I don't think it will send diffs, and there's no way of filtering for "suspicious" edits.
Angela.
Nevertheless, the vast majority of spam on inactive _Wikipedias_ is from unloggedin users.
But only because we don't make them log in, not because it's hard to do so. It's far more of a deterrent to genuine editors than to spam bots.
Unfortunately, that's probably true.
Blocking "that-medicine-that-starts-with-c" will prevent anyone writing about socialism (which was rather a problem for socialism.wikicities.com) :)
See-eye-a-ell-eye-ess is related to _socialism_??
At the risk of ending up in everyone's spam bins, I'll spell it out: "so...Cialis...m". Blocking the word blocks any words that contain it.
Ohh. Duh. But, surely, it would take only a few lines of code to add a feature so that it only blocked the _whole word_?
I mean, does anybody get spam e-mails that say "Free socialism! Click here now"... or even "Get free soCialiSm! cl**k here no*" or anything like that? I don't think spammers are sophisticated enough to realise that there are legitimate words that contain spam-filter'd words.
Also, there is the occurance of _phrases_: "free (name of product or medicine)" is significantly more likely to be spam than even "(name of product or medicine) is a". If you add "get" before the "free", that is even more likely (exponentially?) to be spam. Add a "now" afterwards, and more likely. Add "by" after that, or "for"... For the _second_ one, add the word "nat**al". Then add "h*rb*l"... then "s*p*l*m't", then "for", then "m*le", then that word that you know oh-so-well comes next due to the extreme odds!
Of course, anything that filtered on something as complex as this would be very, very complex programming.
Perhaps instead, somebody could adapt a Free numerical rating system for spam e-mails (which gives "likelyhoods" that e-mails are spam) -- Google may or may not be willing to help out there given how massive their database must be and their commitment to Goodness on the Internet, but if not there would be another project I'm sure.
From that, some things could be adapted. For example, the "from",
"to", and "cc" lines aren't present, and neither is the subject. HTML codes would have to have aliases using WikiCode. Things which might be "automatic kill" for a spam killer would, in many instances, have to be significantly downgraded, at least for the English Wikipedia (for example, the-medicine-that-starts-with-c is a legitimate topic, but in very limited contexts). Talk pages would have to give a certain degree of slack. The greater the length of a page, the more times its title should occur within it, or *related* terms (ie, links to articles which link back to it). So, to a certain extent, "subject" and "article title" would correspond, although the length-title ratio would be significantly different.
Certain IPs would be greylisted based on the relative frequency of spam from them. In fact, every IP range would be assigned a %age based on existing data. If 90% of the content from an IP range is spam, the system might notice if any subranges or particular IPs had a significantly less frequency, and if they did, semiwhitelist them (ie, "good" percentage points). An IP range with 90% of submissions legitimate, on the other hand, would have "good" points. If there were any particular subranges or IPs with a significantly higher perentage of spam, they would be semiblacklisted ("bad" percentage points, or less "good" percentage points, depending on the exact frequency).
I'm going into too much detail here, and obviously it would be a massive undertaking, but given the massive amount of work it would solve, it's not the sort of pipe dream that I feel guilty bringing up in front of people who could actually bring it to fruition (I know I couldn't without learning a programming language first -- right now, I have very rusty Qbasic, medium-to-advanced HTML, a bit of UNL, but nothing else, and the latter two aren't exaclty programming languages).
if I were to receive an e-mail every two hours with a list of suspicious edits, I could revert them immediately as nessecary.
I'd also be more likely to check edits sent to me by email. Perhaps http://meta.wikimedia.org/wiki/EmailNotification could be adapted. Currently, I don't think it will send diffs, and there's no way of filtering for "suspicious" edits.
Ahh, but there are already three-halves party applications (meaning, by Wikipedians, but not software integrated to MeW) which monitor for "suspicious" edits. Nothing complex, but helpful nonetheless in filtering out The Good Edits to give only the bad ones, based on a few very basic observations, as well as feedback.
Mark
Mark Williamson wrote:
Blocking "that-medicine-that-starts-with-c" will prevent anyone writing about socialism (which was rather a problem for socialism.wikicities.com) :)
See-eye-a-ell-eye-ess is related to _socialism_??
At the risk of ending up in everyone's spam bins, I'll spell it out: "so...Cialis...m". Blocking the word blocks any words that contain it.
Ohh. Duh. But, surely, it would take only a few lines of code to add a feature so that it only blocked the _whole word_?
This kind of spamfiltering doesn't really work. Spammers will write CCialiss, __cialis__, "C1al1s", etc. To properly fight spam one needs a bayesian spamfilter. If edits get flagged as spam or non-spam, a database can be built up that allows new edits to be compared with them. These will then get a 'spam chance' P_s flag, and we could define a treshhold P_t where P_s>P_t prevents an edit from getting through. The regular expression 'c[i1][a@][l1][i1]s' has 14 hits in my hammie.db database for Bayesian Spamfiltering using Spambayes, and I'm sure I've missed some.
I mean, does anybody get spam e-mails that say "Free socialism! Click here now"... or even "Get free soCialiSm! cl**k here no*" or anything like that? I don't think spammers are sophisticated enough to realise that there are legitimate words that contain spam-filter'd words.
No, but they do replace letters by characters or introduce spaces in between.
Of course, anything that filtered on something as complex as this would be very, very complex programming.
Not really.
Perhaps instead, somebody could adapt a Free numerical rating system for spam e-mails (which gives "likelyhoods" that e-mails are spam) -- Google may or may not be willing to help out there given how massive their database must be and their commitment to Goodness on the Internet, but if not there would be another project I'm sure.
The good thing about bayesian spamfiltering is that the database is suited to the own need, and the accuracy grows very very quickly as the database gets larger.
I'm going into too much detail here, and obviously it would be a massive undertaking, but given the massive amount of work it would solve, it's not the sort of pipe dream that I feel guilty bringing up in front of people who could actually bring it to fruition (I know I couldn't without learning a programming language first -- right now, I have very rusty Qbasic, medium-to-advanced HTML, a bit of UNL, but nothing else, and the latter two aren't exaclty programming languages).
Let us have a look at components of Spambayes. Those can certainly be used and suited to our task. As tokens we can use IP-addresses, numbers indicating the amount of code removed (needs some more thinking), negative points when text is removed/positive when it's added (e.g. *removing* 'cialis' has the opposite effect as *adding* it), etc.
I can help with adopting Spambayes or using Spambayes components for our needs. I am not an expert, but I know some.
Gerrit.
As most people who are veterans of this list will know very well by now, I am strongly against the "locking" of inactive Wikipedias. However, due to a recent increase in vandalism on inactive Wikipedias, I have decided to suggest a few more minor technical sanctions:
No, no, and no again! Spam filtering is not going to solve anything. I would have thought our wiki history has shown this time and again: You don't use technical means to fight vandalism (or spam or whatever)! You use technical means to make it easier for humans to fight vandalism.
The "correct" wiki way, of course, would be to make it easier for everyone to check the smaller wiki projects without having to check hundreds of pages. I've been saying this for years, and I'll say it again: RecentChanges, Watchlists and Newpages need to go inter-language!
E-mail notification comes close to this, because it allows for the notifications to reach a central place (his inbox) and doesn't require him to keep checking Watchlists and Newpages on hundreds of wikis. I would prefer if both were available so that everyone can choose whatever they find easiest, thereby making spam fighting as efficient as possible.
Timwi
wikipedia-l@lists.wikimedia.org