I am trying to import a few thousand pages that include maps using the Google Maps extension.
The text is in the format as below:
<googlemap lat="36.3015" lon="-76.21976" type="map" zoom="11" width="700" height="600"> 36.3214,-76.2134, Click to see [[some text]] page 36.9876,-76.1234, Click to see [[some text]] page </googlemap>
The message I get is pretty much identical whether I use importDump.php or Special:Import and is:
XML import Failure at line 40, col 42 (byte 1725; ³ 36.3214,-76²):Attribute redefined
Anyone able to shed any light on this?
Many thanks, Paul
I'm still getting spammed with what appears to be chinese spam. Can I use the regex filter to block all non western characters, or specifically chinese ones?
-adrian
There is a bot that keeps posting random words or alphanumeric sequences to the beginning of pages on my wiki.
Is it possible to warn users that post under a certain of characters to the beginning of a page?
I don't want to enforce a captcha on every edit.
-Adrian
I was experiencing the same types of "spam," or whatever it is.
I started to enforce a captcha on every edit, which bothers me, but stopped the junk. I hope there's a better solution out there!
Thanks, Ben
---- OpenOffice and open source blog: http://www.solidoffice.com/
Wiki business directory: http://www.wikipages.com/
On Thursday, October 11, 2007, at 01:04PM, "2007@gmask.com" 2007@gmask.com wrote:
There is a bot that keeps posting random words or alphanumeric sequences to the beginning of pages on my wiki.
Is it possible to warn users that post under a certain of characters to the beginning of a page?
I don't want to enforce a captcha on every edit.
-Adrian
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Benjamin Horst wrote:
I was experiencing the same types of "spam," or whatever it is.
I started to enforce a captcha on every edit, which bothers me, but stopped the junk. I hope there's a better solution out there!
Thanks, Ben
We're having the same problem with our wikis. It seems that this could be solved if thereis a switch in MediaWiki that really mandates that changes be made by registered users.
We are planning to implement the other spam measures mentioned on this list.
Thanks to the advice from last week about how to stop DIV spam.
Chuck
Yea I don't want to stop anonymous users but it seems like that might be neccessary.. or it would be great if you could captcha new posts from either new users or unfamiliar IP's.
-Adrian
--- Chuck chuck@mutualaid.org wrote:
Benjamin Horst wrote:
I was experiencing the same types of "spam," or whatever it is.
I started to enforce a captcha on every edit, which bothers me, but
stopped the junk. I hope there's a better solution out there!
Thanks, Ben
We're having the same problem with our wikis. It seems that this could be solved if thereis a switch in MediaWiki that really mandates that changes be made by registered users.
We are planning to implement the other spam measures mentioned on this list.
Thanks to the advice from last week about how to stop DIV spam.
Chuck
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
A couple of days ago I was also hit by this, or a similar, spambot.
I know the question of deleting users has been brought up before and nixed for valid db reasons. In this case, however, it's different. Looking at the user table on my wiki, a bot has been registering users for the past 6 months, every few minutes. The format of the usernames is consistent: in the beginning the usernames were 6 characters ([a-zA-Z0-9]) with the first and fourth characters capital letters. For a while in July the usernames changed to eight characters, and then recently to 10 characters, with the first and sixth capital letters.
This started *exactly* six months to the day that the spam started (20070410 to 20071010).
Of the 17000 registered users on my site, over 10000 are spambot registrations. We monitor edits pretty closely (though obviously not registrations, which will change), so of those 10000 spambot users, no more than 20 have actually edited a page, all starting on Oct 10th. (We caught it prett quickly.)
So it seems as if I should be able to feed my wiki a list of user and have it delete those users, if, as in this case, I am *absolutely* sure none of these users have edited pages.
So then, the question. If there's not a maintenance script somewhere that reads a user_name or user_id list and zaps them, then is there a danger in deleting those user records from the user table if I *know* they've made no edits?
thanks
On Oct 12, 2007, at 1:19 PM, 2007@gmask.com wrote:
Yea I don't want to stop anonymous users but it seems like that might be neccessary.. or it would be great if you could captcha new posts from either new users or unfamiliar IP's.
-Adrian
--- Chuck chuck@mutualaid.org wrote:
Benjamin Horst wrote:
I was experiencing the same types of "spam," or whatever it is.
I started to enforce a captcha on every edit, which bothers me, but
stopped the junk. I hope there's a better solution out there!
Thanks, Ben
We're having the same problem with our wikis. It seems that this could be solved if thereis a switch in MediaWiki that really mandates that changes be made by registered users.
We are planning to implement the other spam measures mentioned on this list.
Thanks to the advice from last week about how to stop DIV spam.
Chuck
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
dKosopedia admin wrote:
Of the 17000 registered users on my site, over 10000 are spambot registrations. We monitor edits pretty closely (though obviously not registrations, which will change), so of those 10000 spambot users, no more than 20 have actually edited a page, all starting on Oct 10th. (We caught it prett quickly.)
So it seems as if I should be able to feed my wiki a list of user and have it delete those users, if, as in this case, I am *absolutely* sure none of these users have edited pages.
I know it's not a solution, but since I've seen a few edits on my wiki in such cases, I zapped all users with the FooooBaarr name format. I did a MySQL query to grab all the 10 character user IDs where the user ID and the real name were the same (and where it wasn't the one real user who just happened to get caught with this same general format name). With these IDs, I created a script to run changePassword.php (in the /wiki/maintenance/ directory) for every such user. This replaced all the passwords with one I know so these bots can't sign in. I then used SQL to rename the users to zzFooooBaarr so they sort to the bottom in the User List page.
Still there, but unusable by the spammers and not interfering with regular users.
I put the reCaptcha extension in place to try to defeat the bots from making new user IDs.
Mike
Has anybody found a solution for the gibberish spam short of installing captcha extensions?
Chuck
2007@gmask.com wrote:
Yea I don't want to stop anonymous users but it seems like that might be neccessary.. or it would be great if you could captcha new posts from either new users or unfamiliar IP's.
-Adrian
--- Chuck chuck@mutualaid.org wrote:
Benjamin Horst wrote:
I was experiencing the same types of "spam," or whatever it is.
I started to enforce a captcha on every edit, which bothers me, but
stopped the junk. I hope there's a better solution out there!
Thanks, Ben
We're having the same problem with our wikis. It seems that this could be solved if thereis a switch in MediaWiki that really mandates that changes be made by registered users.
We are planning to implement the other spam measures mentioned on this list.
Thanks to the advice from last week about how to stop DIV spam.
Chuck
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On 16/10/2007, Chuck chuck@mutualaid.org wrote:
Has anybody found a solution for the gibberish spam short of installing captcha extensions?
Not here. I set anonymous edits to false and installed reCaptcha.
Installing reCaptcha doesn't really constitute a big change to the wiki (its quite unobtrusive really), but disallowing anonymous edits is a pain. This is especially true of pre 1.11.0 MW versions where 'view source' doesn't seem to work 'out of the box'.
In theory there should be a simple SQL query to detect these kinds of spam (one nonsense word at the start of a page) - However, it seems better to code a general solution that highlights potential spam for review. Its keeping track of the potentially spammed pages that I find most difficult.
Anyone handy with Bayesian filters? If we could rank edits by 'spaminess' using a Bayesian filter, and be given the option to review the top n most spammy revisions (with feedback training) ... well... that would be great!
Send all your edits to a gmail account and only allow those that get forwarded back?
Chuck
2007@gmask.com wrote:
Yea I don't want to stop anonymous users but it seems like that might be neccessary.. or it would be great if you could captcha new posts from either new users or unfamiliar IP's.
-Adrian
--- Chuck chuck@mutualaid.org wrote:
Benjamin Horst wrote:
I was experiencing the same types of "spam," or whatever it is.
I started to enforce a captcha on every edit, which bothers me, but
stopped the junk. I hope there's a better solution out there!
Thanks, Ben
We're having the same problem with our wikis. It seems that this could be solved if thereis a switch in MediaWiki that really mandates that changes be made by registered users.
We are planning to implement the other spam measures mentioned on this list.
Thanks to the advice from last week about how to stop DIV spam.
Chuck
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
--
Bread and Roses Web Design serving small businesses, non-profits, artists and activists http://www.breadandrosesweb.com/
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
I think making an IP do a captcha on its first edit only would help. The captcha would keep a record of the most recent IP's in a table and if an edit hasnt been recorded from them, give them a captcha, otherwise pass. This may mean the IP table would grow large (may not be fast for large wikis with lots of editing), but purge it a certain period (15 days etc). Captcha has to be there one way or the other. This is least irritating. There is definitely no way to check if an edit is spam or not, except for capthcha.
Dan Bolser dan.bolser@gmail.com wrote: On 16/10/2007, Chuck wrote:
Has anybody found a solution for the gibberish spam short of installing captcha extensions?
Not here. I set anonymous edits to false and installed reCaptcha.
Installing reCaptcha doesn't really constitute a big change to the wiki (its quite unobtrusive really), but disallowing anonymous edits is a pain. This is especially true of pre 1.11.0 MW versions where 'view source' doesn't seem to work 'out of the box'.
In theory there should be a simple SQL query to detect these kinds of spam (one nonsense word at the start of a page) - However, it seems better to code a general solution that highlights potential spam for review. Its keeping track of the potentially spammed pages that I find most difficult.
Anyone handy with Bayesian filters? If we could rank edits by 'spaminess' using a Bayesian filter, and be given the option to review the top n most spammy revisions (with feedback training) ... well... that would be great!
Send all your edits to a gmail account and only allow those that get forwarded back?
Chuck
2007@gmask.com wrote:
Yea I don't want to stop anonymous users but it seems like that might be neccessary.. or it would be great if you could captcha new posts from either new users or unfamiliar IP's.
-Adrian
--- Chuck wrote:
Benjamin Horst wrote:
I was experiencing the same types of "spam," or whatever it is.
I started to enforce a captcha on every edit, which bothers me, but
stopped the junk. I hope there's a better solution out there!
Thanks, Ben
We're having the same problem with our wikis. It seems that this could be solved if thereis a switch in MediaWiki that really mandates that changes be made by registered users.
We are planning to implement the other spam measures mentioned on this list.
Thanks to the advice from last week about how to stop DIV spam.
Chuck
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
--
Bread and Roses Web Design serving small businesses, non-profits, artists and activists http://www.breadandrosesweb.com/
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On 16/10/2007, Eric K ek79501@yahoo.com wrote:
I think making an IP do a captcha on its first edit only would help. The captcha would keep a record of the most recent IP's in a table and if an edit hasnt been recorded from them, give them a captcha, otherwise pass. This may mean the IP table would grow large (may not be fast for large wikis with lots of editing), but purge it a certain period (15 days etc). Captcha has to be there one way or the other. This is least irritating.
I like the idea.
There is definitely no way to check if an edit is spam or not, except for capthcha.
Not currently. I think a 'review for spam' feature would work very well for most small sites.
Dan Bolser dan.bolser@gmail.com wrote: On 16/10/2007, Chuck wrote:
Has anybody found a solution for the gibberish spam short of installing captcha extensions?
Not here. I set anonymous edits to false and installed reCaptcha.
Installing reCaptcha doesn't really constitute a big change to the wiki (its quite unobtrusive really), but disallowing anonymous edits is a pain. This is especially true of pre 1.11.0 MW versions where 'view source' doesn't seem to work 'out of the box'.
In theory there should be a simple SQL query to detect these kinds of spam (one nonsense word at the start of a page) - However, it seems better to code a general solution that highlights potential spam for review. Its keeping track of the potentially spammed pages that I find most difficult.
Anyone handy with Bayesian filters? If we could rank edits by 'spaminess' using a Bayesian filter, and be given the option to review the top n most spammy revisions (with feedback training) ... well... that would be great!
Send all your edits to a gmail account and only allow those that get forwarded back?
Chuck
2007@gmask.com wrote:
Yea I don't want to stop anonymous users but it seems like that might be neccessary.. or it would be great if you could captcha new posts from either new users or unfamiliar IP's.
-Adrian
--- Chuck wrote:
Benjamin Horst wrote:
I was experiencing the same types of "spam," or whatever it is.
I started to enforce a captcha on every edit, which bothers me, but
stopped the junk. I hope there's a better solution out there!
Thanks, Ben
We're having the same problem with our wikis. It seems that this could be solved if thereis a switch in MediaWiki that really mandates that changes be made by registered users.
We are planning to implement the other spam measures mentioned on this list.
Thanks to the advice from last week about how to stop DIV spam.
Chuck
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
--
Bread and Roses Web Design serving small businesses, non-profits, artists and activists http://www.breadandrosesweb.com/
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
-- hello
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Don't let your dream ride pass you by. Make it a reality with Yahoo! Autos. _______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
On 16/10/2007, Eric K ek79501@yahoo.com wrote:
There is definitely no way to check if an edit is spam or not, except for capthcha.
I have to point out the flaw in that statement, tenuous thought it is - a CAPTCHA does *not* constitute an anti-spam acid test; all it does is confirms that, to the best of the test's ability (which might not count for anything), that we are dealing with a human being, rather than an automated program.
A human could quite well post spam to his/her heart's content, and would be able to pass a CAPTCHA (we hope). The default configuration settings for ConfirmEdit, which CAPTCHA extensions are based upon, allow registered users to skip these tests, so in theory, one could set up a spam bot with a few minutes of initial human assistance, which is why we supplement such things with throttles, "heuristics" (regular expressions aren't that great in terms of configurability, but I cling to the hope that one day we'll have decent spam-edit detection heuristics, even if just for the basics).
Rob Church
I agree, you're right, to be more accurate, captcha only makes certain that a human is editing the page (then to get more technical, complex bots can solve the captcha). Throttling is also necessary - anything to prevent bots from doing the things they do good.
Rob Church robchur@gmail.com wrote: On 16/10/2007, Eric K wrote:
There is definitely no way to check if an edit is spam or not, except for capthcha.
I have to point out the flaw in that statement, tenuous thought it is - a CAPTCHA does *not* constitute an anti-spam acid test; all it does is confirms that, to the best of the test's ability (which might not count for anything), that we are dealing with a human being, rather than an automated program.
A human could quite well post spam to his/her heart's content, and would be able to pass a CAPTCHA (we hope). The default configuration settings for ConfirmEdit, which CAPTCHA extensions are based upon, allow registered users to skip these tests, so in theory, one could set up a spam bot with a few minutes of initial human assistance, which is why we supplement such things with throttles, "heuristics" (regular expressions aren't that great in terms of configurability, but I cling to the hope that one day we'll have decent spam-edit detection heuristics, even if just for the basics).
Rob Church
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
--------------------------------- Don't let your dream ride pass you by. Make it a reality with Yahoo! Autos.
I want to add the functionality similar to what is available here:
http://en.wikipedia.org/wiki/Template:Anarchism
Any Idea how the Show/Hide feature is working?
Thanks,
Nikhil
Shah, Nikhil wrote:
I want to add the functionality similar to what is available here:
http://en.wikipedia.org/wiki/Template:Anarchism
Any Idea how the Show/Hide feature is working?
Thanks,
Nikhil
With JavaScript: http://en.wikipedia.org/wiki/Wikipedia:NavFrame
Thank you Platonides this is exactly what I was looking for.
The common.css & common.js available on http://en.wikipedia.org/wiki/Wikipedia:NavFrame are very different from the one I have currently.
I am wondering if the best way to implement is to manually merger the files? Any other shortcut?
-----Original Message----- From: mediawiki-l-bounces@lists.wikimedia.org [mailto:mediawiki-l-bounces@lists.wikimedia.org] On Behalf Of Platonides Sent: Wednesday, October 17, 2007 08:35 To: mediawiki-l@lists.wikimedia.org Subject: Re: [Mediawiki-l] How does the Show/Hide funcationality work onthis template?
Shah, Nikhil wrote:
I want to add the functionality similar to what is available here:
http://en.wikipedia.org/wiki/Template:Anarchism
Any Idea how the Show/Hide feature is working?
Thanks,
Nikhil
With JavaScript: http://en.wikipedia.org/wiki/Wikipedia:NavFrame
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
This gibberish spam doesn't make much sense, pardon the pun. The spambot isn't inserting any actual links. My wikis are getting spammed with short text strings like "copasnotra" and "romonboel". Based on my limited understanding of spambots, it seems like the bots are making these changes as a prelude to doing something else.
After some further investigation, some interesting clues emerge. This "gibberish spambot" is evidently generating fake user accounts. I deleted hundreds of fake accounts last night from the four wiki databases that we run. The spambot is surprisingly doing something that should make it easy to stop them: all of their fake user accounts include an email address from the ".ru" domain. The user names are all different, but the spambot only uses a limited number of fake email addresses from the .ru domain. Would it be possible to reject user registrations with code that rejects anything from a certain domain?
Another facet of this problem is that this spambot is using proxy ISPs or rotating fake IP addresses. In my experience, this is a common method that spambots use to defeat easy anti-spam measures like server level IP blocking.
Now that I think about it, I may have thwarted the final stage of this bot's activities by implementing that spam hack that stops hidden DIV spam. But our wikis are still getting hit hard by the "gibberish spam". It's unclear if the hidden DIV spam and the gibberish spam are part of the same spambots suite of attacks.
Chuck
--- Chuck chuck@mutualaid.org wrote:
This gibberish spam doesn't make much sense, pardon the pun. The spambot isn't inserting any actual links. My wikis are getting spammed with short text strings like "copasnotra" and "romonboel". Based on my limited understanding of spambots, it seems like the bots are making these changes as a prelude to doing something else.
This is what is happening to me as well.. but the inserted words are allways at the beginning of the page which gives me hope in blocking these types of bot edits with a regex.
-adrian
2007@gmask.com wrote:
This is what is happening to me as well.. but the inserted words are allways at the beginning of the page which gives me hope in blocking these types of bot edits with a regex.
Right. This is the same bot we're having problems with.
Chuck
So what would the syntax be to match something that begins at the start of the page?
Sort of what I'm thinking is to try and match anonymous users who post under a certain number of characters to the beginning of a page.
But it seems like regex is limited to matching the beginning of a line.
-Adrian
--- Chuck chuck@mutualaid.org wrote:
2007@gmask.com wrote:
This is what is happening to me as well.. but the inserted words
are
allways at the beginning of the page which gives me hope in
blocking
these types of bot edits with a regex.
Right. This is the same bot we're having problems with.
Chuck
How about similar to this? $text = explode("\n", $revision->getText()); If (preg_match($gibberishRegex, $text[0]) ) { Return "bad user"; } else { Return "ok"; }
-----Original Message----- From: mediawiki-l-bounces@lists.wikimedia.org [mailto:mediawiki-l-bounces@lists.wikimedia.org] On Behalf Of 2007@gmask.com Sent: Tuesday, October 16, 2007 2:34 PM To: mediawiki-l@lists.wikimedia.org Subject: Re: [Mediawiki-l] jibberish REGEX syntax help
So what would the syntax be to match something that begins at the start of the page?
Sort of what I'm thinking is to try and match anonymous users who post under a certain number of characters to the beginning of a page.
But it seems like regex is limited to matching the beginning of a line.
-Adrian
--- Chuck chuck@mutualaid.org wrote:
2007@gmask.com wrote:
This is what is happening to me as well.. but the inserted words
are
allways at the beginning of the page which gives me hope in
blocking
these types of bot edits with a regex.
Right. This is the same bot we're having problems with.
Chuck
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
What will you do when the pattern of spam immediately changes?
On 19/10/2007, Christensen, Courtney ChristensenC@battelle.org wrote:
How about similar to this? $text = explode("\n", $revision->getText()); If (preg_match($gibberishRegex, $text[0]) ) { Return "bad user"; } else { Return "ok"; }
-----Original Message----- From: mediawiki-l-bounces@lists.wikimedia.org [mailto:mediawiki-l-bounces@lists.wikimedia.org] On Behalf Of 2007@gmask.com Sent: Tuesday, October 16, 2007 2:34 PM To: mediawiki-l@lists.wikimedia.org Subject: Re: [Mediawiki-l] jibberish REGEX syntax help
So what would the syntax be to match something that begins at the start of the page?
Sort of what I'm thinking is to try and match anonymous users who post under a certain number of characters to the beginning of a page.
But it seems like regex is limited to matching the beginning of a line.
-Adrian
--- Chuck chuck@mutualaid.org wrote:
2007@gmask.com wrote:
This is what is happening to me as well.. but the inserted words
are
allways at the beginning of the page which gives me hope in
blocking
these types of bot edits with a regex.
Right. This is the same bot we're having problems with.
Chuck
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
2007@gmask.com wrote:
This is what is happening to me as well.. but the inserted words are allways at the beginning of the page which gives me hope in blocking these types of bot edits with a regex.
I was thinking that this could be checked against a dictionary. If the first "word" inserted is not in the dictionary (for the page's language), require the user to confirm the save. A bot won't confirm.
This would have to be smart enough to skip wikitext (e.g. don't worry about "[[Image:"). Similarly, it would choke on obscure acronyms, but a real person would not likely complain too much.
This could be a hook into the "save" code and only need check for the first word. However, the bot writer can switch to posting at the end of the article... Possibly, a scan of the entire page to reject exceptionally bad spelling might suffice, but will put off some contributers (and annoy US vs Canadian vs British spellers if the bad spelling algorithm isn't smart enough to think honour vs honor isn't that bad).
Mike
On 16/10/2007, Michael Daly michaeldaly@kayakwiki.org wrote:
2007@gmask.com wrote:
This is what is happening to me as well.. but the inserted words are allways at the beginning of the page which gives me hope in blocking these types of bot edits with a regex.
I was thinking that this could be checked against a dictionary. If the first "word" inserted is not in the dictionary (for the page's language), require the user to confirm the save. A bot won't confirm.
This would have to be smart enough to skip wikitext (e.g. don't worry about "[[Image:"). Similarly, it would choke on obscure acronyms, but a real person would not likely complain too much.
This could be a hook into the "save" code and only need check for the first word. However, the bot writer can switch to posting at the end of the article... Possibly, a scan of the entire page to reject exceptionally bad spelling might suffice, but will put off some contributers (and annoy US vs Canadian vs British spellers if the bad spelling algorithm isn't smart enough to think honour vs honor isn't that bad).
So; 1) We are all seeing the same kind of spam. 2) We need something that looks at the whole edit, and isn't based on some trivial aspect of the particular spam attack (that could easily be changed). 3) We need something that goes beyond an 'are you a human captcha' - because such tests are either too infrequent to be useful or too common to be tenable.
4) What is wrong with a Bayesian (email style) spam filter?
Each edit gets certain attributes set - username and email or IP address, number of good edits from this user, edit frequency of this user, edit diff text, etc. - and then the Bayesian filter flags the edit with a 'level of spamminess'. Depending on configuration spammy edits can be flat out rejected with multiple spams leading to automatic bans. Or potential spam can be queued in a special list of edits for review (the review process being key to learning the patterns of spam). Such a filter could equally be applied to vandalism... Also (while I am at it) sysops will have the option to 'mark edit as spam', providing more data for the training algorithm.
So there is only one problem... Were should we start?
Some Googling for PHP code to nick looks promising...
http://www.phpclasses.org/browse/file/9319.html Guestbook Example with SpamFilter http://www.squirrelmail.org/plugin_view.php?id=115 uses a Bayesian algorithm to determine what you consider to be spam.
Dan Bolser wrote:
- What is wrong with a Bayesian (email style) spam filter?
Look at bogofilter. There is no reason you couldn't pipe all changes through it - and creating HAM and SPAM files for sorting and training. I use it for email with very good results.
---------------------------------------------------------------- Karl Schmidt EMail Karl@xtronics.com Transtronics, Inc. WEB http://xtronics.com 3209 West 9th Street Ph (785) 841-3089 Lawrence, KS 66049 FAX (785) 841-0434
Why are so many spending time watching dark movies about hopelessness, the macabre, and perversion; why are they reading books about unfaithfulness and self destruction? Why is nothing uplifting, also considered 'cool' or entertaining? -kps
----------------------------------------------------------------
On Tue, 2007-10-16 at 10:04 -0500, Chuck wrote:
My wikis are getting spammed with short text strings like "copasnotra" and "romonboel". Based on my limited understanding of spambots, it seems like the bots are making these changes as a prelude to doing something else
What they're doing is polluting the database of heuristics, by inserting either common or nonsense words. For example, if (prior to this tactic), the amount of "spammy" words in the table (Viagra, etc.) was 80% of the total number of words, they fill the database with common, nonsense words to get the quality of the filter to lower itself enough to let the spammy words back through, by pushing them down below that threshold.
I've seen this used for years while using dspam, but thankfully for us, dspam has kept us 100% spam-free for years. Not a single spam email or other garbage in any user's mailbox going on years, with only very minimal false-positives.
Perhaps a look at their methods, and rolling those in to mediawiki's anti-spammy comment approach might be worthwhile?
Hello,
If you can attach a snap of "what appears to be chinese spam", I can help you identify if it is chinese and what does it say if it is in chinese.
Nelson
Computer Sciences Corporation Registered Office: 2100 East Grand Avenue, El Segundo California 90245, USA Registered in USA No: C-489-59
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind CSC to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
"2007@gmask.com" 2007@gmask.com Sent by: To mediawiki-l-bounc mediawiki-l@lists.wikimedia.org es@lists.wikimedi cc a.org Subject [Mediawiki-l] chinese spam 10/11/2007 12:09 PM
Please respond to 2007@gmask.com; Please respond to MediaWiki announcements and site admin list <mediawiki-l@list s.wikimedia.org>
I'm still getting spammed with what appears to be chinese spam. Can I use the regex filter to block all non western characters, or specifically chinese ones?
-adrian
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
NAL> Can I use the regex filter to block all non western characters Yes. [^[:ascii:]] matches any non-ASCII character.
Hmm... that doesn't seem to block the simplified chinese characters.
--- jidanni@jidanni.org wrote:
NAL> Can I use the regex filter to block all non western characters Yes. [^[:ascii:]] matches any non-ASCII character.
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/mediawiki-l
ASCII spam fighter for Chinese sites: $wgSpamRegex='/^[[:ascii:]]*$/';#Must have at least one non-ASCII to post. Chinese spam fighter for ASCII sites: $wgSpamRegex='/[^[:ascii:]]/';#Must be pure ASCII to post.
Hmm... that doesn't seem to block the simplified chinese characters.
Maybe they used "&#...;" HTML Entities? Well then add a regexp to catch entities or whatever they are using too perhaps?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Paul Coghlan wrote:
I am trying to import a few thousand pages that include maps using the Google Maps extension.
The text is in the format as below:
<googlemap lat="36.3015" lon="-76.21976" type="map" zoom="11" width="700" height="600"> 36.3214,-76.2134, Click to see [[some text]] page 36.9876,-76.1234, Click to see [[some text]] page
</googlemap>
The message I get is pretty much identical whether I use importDump.php or Special:Import and is:
XML import Failure at line 40, col 42 (byte 1725; ³ 36.3214,-76²):Attribute redefined
Anyone able to shed any light on this?
Can you confirm that your XML is properly formatted? Remember the page text is *text*, so '<' '>' etc should appear as '<' '>' etc in your .xml file.
- -- brion vibber (brion @ wikimedia.org)
mediawiki-l@lists.wikimedia.org