I don't understand the syntax for contructing regular expressions yet, can anybody help please?
I wanted to make one which catches all external links to sites which begin with the number 1 and also end with .org as this is a spam pattern I have been having repeated trouble with recently, but using |1*.org| has the unintended effect of blocking all .orgs
So I reverted to this:
$wgSpamRegex="/(phente|xanax|valium|geocities|casino|myflooring.org|display:none|-ambien|overflow:[ \t\n]*auto)/";
Cheers
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Andy Roberts wrote:
I don't understand the syntax for contructing regular expressions yet, can anybody help please?
I wanted to make one which catches all external links to sites which begin with the number 1 and also end with .org as this is a spam pattern I have been having repeated trouble with recently, but using |1*.org| has the unintended effect of blocking all .orgs
So I reverted to this:
$wgSpamRegex="/(phente|xanax|valium|geocities|casino|myflooring.org|display:none|-ambien|overflow:[ \t\n]*auto)/";
Cheers
I think |1[a-z.]*.org| should work, but I haven't tested it. The phrase "1*" actaully means "zero or more occurrences of the character 1".
You can search the Internet for help about Regular Expressions.
- -- #define Name RotemLiss #define Mail mailSTRUDELrotemlissDOTcom #define Site www.rotemliss.com
#define KeyFingerPrint 4AFD 8579 A449 4267 BED9 38E5 6EF8 5B1F EBDE 7AC0
On 7/18/06, Rotem Liss mail@rotemliss.com wrote:
Andy Roberts wrote:
I wanted to make one which catches all external links to sites which begin with the number 1 and also end with .org as this is a spam pattern I have been having repeated trouble with recently, but using |1*.org| has the unintended effect of blocking all .orgs
I think |1[a-z.]*.org| should work, but I haven't tested it. The phrase "1*" actaully means "zero or more occurrences of the character 1".
This should get it, I think.
$wgSpamRegex="/1.*.org/";
Dan
On 7/18/06, Dan Davis hokie99cpe+wiki@gmail.com wrote:
On 7/18/06, Rotem Liss mail@rotemliss.com wrote:
Andy Roberts wrote:
I wanted to make one which catches all external links to sites which begin with the number 1 and also end with .org as this is a spam pattern I have been having repeated trouble with recently, but using |1*.org| has the unintended effect of blocking all .orgs
I think |1[a-z.]*.org| should work, but I haven't tested it. The phrase "1*" actaully means "zero or more occurrences of the character 1".
This should get it, I think.
$wgSpamRegex="/1.*.org/";
Which is interpreted as
/1 - the literal characters "/" then "1" followed by .* - a run of zero or more arbitrary characters . - the literal character "." (the backslash is needed to treat . as a literal rather than it's normal meaning of any character org/ - the literal characters "org"
On 18/07/06, Rick DeNatale rick.denatale@gmail.com wrote:
On 7/18/06, Dan Davis hokie99cpe+wiki@gmail.com wrote:
On 7/18/06, Rotem Liss mail@rotemliss.com wrote:
Andy Roberts wrote:
I wanted to make one which catches all external links to sites which begin with the number 1 and also end with .org as this is a spam pattern I have been having repeated trouble with recently, but using |1*.org| has the unintended effect of blocking all .orgs
I think |1[a-z.]*.org| should work, but I haven't tested it. The phrase "1*" actaully means "zero or more occurrences of the character 1".
This should get it, I think.
$wgSpamRegex="/1.*.org/";
Which is interpreted as
/1 - the literal characters "/" then "1" followed by .* - a run of zero or more arbitrary characters . - the literal character "." (the backslash is needed to treat . as a literal rather than it's normal meaning of any character org/ - the literal characters "org"
Thanks for replies.
In the end I modified Rotem's suggestion. I tested it using http://ioctl.org/jan/test/regexp.htm and found that it matched with the number "1" anywhere in the top level domain, not just at the start hence:
|.1[a-z.]*.org|
seems to cover it.
On 7/18/06, Andy Roberts aroberts@gmail.com wrote:
In the end I modified Rotem's suggestion. I tested it using http://ioctl.org/jan/test/regexp.htm and found that it matched with the number "1" anywhere in the top level domain, not just at the start hence:
|.1[a-z.]*.org|
seems to cover it.
Oops... forgot about that part... $wgSpamRegex="/.1.*.org/";
Are you worried about sites with multiple numbers at the beginning 11site.org? Or, are they all a single 1 followed by text?
Dan
On 18/07/06, Dan Davis hokie99cpe+wiki@gmail.com wrote:
On 7/18/06, Andy Roberts aroberts@gmail.com wrote:
In the end I modified Rotem's suggestion. I tested it using http://ioctl.org/jan/test/regexp.htm and found that it matched with the number "1" anywhere in the top level domain, not just at the start hence:
|.1[a-z.]*.org|
seems to cover it.
Oops... forgot about that part... $wgSpamRegex="/.1.*.org/";
Are you worried about sites with multiple numbers at the beginning 11site.org? Or, are they all a single 1 followed by text?
Just single 1's for now.
Here's another pattern though -
I suffer roughly weekly from a bot which doesn't add any links, it just edits several existing pages and adds a line consisting of about a dozen random digits eg:
300142760257
But different every time. That kind of behaviour, combined with changing IP nos and delays of a few minutes between edits seems to be theoretically impossible to defend against, as well as pointless.
I'm loathe to force login to edit, because the number of genuine contributions does drop a little when I resort to that.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Andy Roberts wrote:
On 18/07/06, Dan Davis hokie99cpe+wiki@gmail.com wrote: Just single 1's for now.
Here's another pattern though -
I suffer roughly weekly from a bot which doesn't add any links, it just edits several existing pages and adds a line consisting of about a dozen random digits eg:
300142760257
But different every time. That kind of behaviour, combined with changing IP nos and delays of a few minutes between edits seems to be theoretically impossible to defend against, as well as pointless.
I'm loathe to force login to edit, because the number of genuine contributions does drop a little when I resort to that.
If you want to block big numbers without commas, you can use: \d{7,} ? this should block every number above 999,999 which contains no commas, although I haven't checked it. You can tweak the minimal number of digits to block by editing the number (\d{6,} will block all the numbers above 99,999, and \d{8,} will block all the numbers above 9,999,999, etc.).
- -- #define Name RotemLiss #define Mail mail-AT-rotemliss-DOT-com #define Site www.rotemliss.com
#define KeyFingerPrint 4AFD 8579 A449 4267 BED9 38E5 6EF8 5B1F EBDE 7AC0
On Tue, 18 Jul 2006, Rotem Liss wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Andy Roberts wrote:
On 18/07/06, Dan Davis hokie99cpe+wiki@gmail.com wrote: Just single 1's for now.
Here's another pattern though -
I suffer roughly weekly from a bot which doesn't add any links, it just edits several existing pages and adds a line consisting of about a dozen random digits eg:
300142760257
But different every time. That kind of behaviour, combined with changing IP nos and delays of a few minutes between edits seems to be theoretically impossible to defend against, as well as pointless.
I'm loathe to force login to edit, because the number of genuine contributions does drop a little when I resort to that.
If you want to block big numbers without commas, you can use: \d{7,} ? this should block every number above 999,999 which contains no commas, although I haven't checked it. You can tweak the minimal number of digits to block by editing the number (\d{6,} will block all the numbers above 99,999, and \d{8,} will block all the numbers above 9,999,999, etc.).
I came up with this pattern:
(?<!\S)\d{12,}(?!\S)
which means "at least 12 digits which are not preceded or followed by non-whitespace characters". The idea is to allow URLs which contain long sequences of digits.
Martin
-- Martin Jambon, PhD http://martin.jambon.free.fr
On 18/07/06, Martin Jambon martin_jambon@emailuser.net wrote:
On Tue, 18 Jul 2006, Rotem Liss wrote:
Andy Roberts wrote:
Here's another pattern though -
I suffer roughly weekly from a bot which doesn't add any links, it just edits several existing pages and adds a line consisting of about a dozen random digits eg:
300142760257
But different every time. That kind of behaviour, combined with changing IP nos and delays of a few minutes between edits seems to be theoretically impossible to defend against, as well as pointless.
I'm loathe to force login to edit, because the number of genuine contributions does drop a little when I resort to that.
If you want to block big numbers without commas, you can use: \d{7,} ? this should block every number above 999,999 which contains no commas, although I haven't checked it. You can tweak the minimal number of digits to block by editing the number (\d{6,} will block all the numbers above 99,999, and \d{8,} will block all the numbers above 9,999,999, etc.).
I came up with this pattern:
(?<!\S)\d{12,}(?!\S)
which means "at least 12 digits which are not preceded or followed by non-whitespace characters". The idea is to allow URLs which contain long sequences of digits.
Thanks Martin.
I like the way you were able to translate your RegEx and rationale into logical English.
mediawiki-l@lists.wikimedia.org