Syntax for RegEx

List overview All Threads
Download

newer

older

Loss of session data

Looking for best User activity log...

Andy Roberts

18 Jul 2006 18 Jul '06

6:37 p.m.

I don't understand the syntax for contructing regular expressions yet, can anybody help please?

I wanted to make one which catches all external links to sites which begin with the number 1 and also end with .org as this is a spam pattern I have been having repeated trouble with recently, but using |1*.org| has the unintended effect of blocking all .orgs

So I reverted to this:

Cheers

-- Andy Roberts

Show replies by date

Rotem Liss

18 Jul 18 Jul

6:49 p.m.

New subject: [Mediawiki-l] Syntax for RegEx

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Andy Roberts wrote:

...

I don't understand the syntax for contructing regular expressions yet, can anybody help please?

I wanted to make one which catches all external links to sites which begin with the number 1 and also end with .org as this is a spam pattern I have been having repeated trouble with recently, but using |1*.org| has the unintended effect of blocking all .orgs

So I reverted to this:

$wgSpamRegex="/(phente|xanax|valium|geocities|casino|myflooring.org|display:none|-ambien|overflow:[ \t\n]*auto)/";

Cheers

I think |1[a-z.]*.org| should work, but I haven't tested it. The phrase "1*" actaully means "zero or more occurrences of the character 1".

You can search the Internet for help about Regular Expressions.

- -- #define Name RotemLiss #define Mail mailSTRUDELrotemlissDOTcom #define Site www.rotemliss.com

#define KeyFingerPrint 4AFD 8579 A449 4267 BED9 38E5 6EF8 5B1F EBDE 7AC0

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFEvLyibvhbH+veesARAvgvAJ4xi66KM9V63wELbtaOqoyhpD3QsQCgpTOO shHBmfsGTTBCfhOx6XVC0ys= =MI0/ -----END PGP SIGNATURE-----

Dan Davis

6:53 p.m.

New subject: [Mediawiki-l] Syntax for RegEx

On 7/18/06, Rotem Liss mail@rotemliss.com wrote:

...

Andy Roberts wrote:

...
I wanted to make one which catches all external links to sites which begin with the number 1 and also end with .org as this is a spam pattern I have been having repeated trouble with recently, but using |1*.org| has the unintended effect of blocking all .orgs

I think |1[a-z.]*.org| should work, but I haven't tested it. The phrase "1*" actaully means "zero or more occurrences of the character 1".

This should get it, I think.

$wgSpamRegex="/1.*.org/";

Dan

Rick DeNatale

7:53 p.m.

New subject: [Mediawiki-l] Syntax for RegEx

On 7/18/06, Dan Davis hokie99cpe+wiki@gmail.com wrote:

...

On 7/18/06, Rotem Liss mail@rotemliss.com wrote:

...
Andy Roberts wrote:

...
I wanted to make one which catches all external links to sites which begin with the number 1 and also end with .org as this is a spam pattern I have been having repeated trouble with recently, but using |1*.org| has the unintended effect of blocking all .orgs

I think |1[a-z.]*.org| should work, but I haven't tested it. The phrase "1*" actaully means "zero or more occurrences of the character 1".

This should get it, I think.

$wgSpamRegex="/1.*.org/";

Which is interpreted as

/1 - the literal characters "/" then "1" followed by .* - a run of zero or more arbitrary characters . - the literal character "." (the backslash is needed to treat . as a literal rather than it's normal meaning of any character org/ - the literal characters "org"

-- Rick DeNatale IPMS/USA Region 12 Coordinator http://ipmsr12.denhaven2.com/ Visit the Project Mercury Wiki Site http://www.mercuryspacecraft.com/

Andy Roberts

8:12 p.m.

New subject: [Mediawiki-l] Syntax for RegEx

On 18/07/06, Rick DeNatale rick.denatale@gmail.com wrote:

...

On 7/18/06, Dan Davis hokie99cpe+wiki@gmail.com wrote:

...
On 7/18/06, Rotem Liss mail@rotemliss.com wrote:

...
Andy Roberts wrote:

...
I wanted to make one which catches all external links to sites which begin with the number 1 and also end with .org as this is a spam pattern I have been having repeated trouble with recently, but using |1*.org| has the unintended effect of blocking all .orgs

I think |1[a-z.]*.org| should work, but I haven't tested it. The phrase "1*" actaully means "zero or more occurrences of the character 1".

This should get it, I think.

$wgSpamRegex="/1.*.org/";

Which is interpreted as

/1 - the literal characters "/" then "1" followed by .* - a run of zero or more arbitrary characters . - the literal character "." (the backslash is needed to treat . as a literal rather than it's normal meaning of any character org/ - the literal characters "org"

Thanks for replies.

In the end I modified Rotem's suggestion. I tested it using http://ioctl.org/jan/test/regexp.htm and found that it matched with the number "1" anywhere in the top level domain, not just at the start hence:

|.1[a-z.]*.org|

seems to cover it.

-- Andy Roberts http://distributedresearch.net/blog/

Dan Davis

8:44 p.m.

New subject: [Mediawiki-l] Syntax for RegEx

On 7/18/06, Andy Roberts aroberts@gmail.com wrote:

...

In the end I modified Rotem's suggestion. I tested it using http://ioctl.org/jan/test/regexp.htm and found that it matched with the number "1" anywhere in the top level domain, not just at the start hence:

|.1[a-z.]*.org|

seems to cover it.

Oops... forgot about that part... $wgSpamRegex="/.1.*.org/";

Are you worried about sites with multiple numbers at the beginning 11site.org? Or, are they all a single 1 followed by text?

Dan

Andy Roberts

10:28 p.m.

New subject: [Mediawiki-l] Syntax for RegEx

On 18/07/06, Dan Davis hokie99cpe+wiki@gmail.com wrote:

...

On 7/18/06, Andy Roberts aroberts@gmail.com wrote:

...
In the end I modified Rotem's suggestion. I tested it using http://ioctl.org/jan/test/regexp.htm and found that it matched with the number "1" anywhere in the top level domain, not just at the start hence:

|.1[a-z.]*.org|

seems to cover it.

Oops... forgot about that part... $wgSpamRegex="/.1.*.org/";

Are you worried about sites with multiple numbers at the beginning 11site.org? Or, are they all a single 1 followed by text?

Just single 1's for now.

Here's another pattern though -

I suffer roughly weekly from a bot which doesn't add any links, it just edits several existing pages and adds a line consisting of about a dozen random digits eg:

300142760257

But different every time. That kind of behaviour, combined with changing IP nos and delays of a few minutes between edits seems to be theoretically impossible to defend against, as well as pointless.

I'm loathe to force login to edit, because the number of genuine contributions does drop a little when I resort to that.

-- Andy Roberts

Rotem Liss

10:54 p.m.

New subject: [Mediawiki-l] Syntax for RegEx

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Andy Roberts wrote:

...

On 18/07/06, Dan Davis hokie99cpe+wiki@gmail.com wrote: Just single 1's for now.

Here's another pattern though -

I suffer roughly weekly from a bot which doesn't add any links, it just edits several existing pages and adds a line consisting of about a dozen random digits eg:

300142760257

But different every time. That kind of behaviour, combined with changing IP nos and delays of a few minutes between edits seems to be theoretically impossible to defend against, as well as pointless.

I'm loathe to force login to edit, because the number of genuine contributions does drop a little when I resort to that.

If you want to block big numbers without commas, you can use: \d{7,} ? this should block every number above 999,999 which contains no commas, although I haven't checked it. You can tweak the minimal number of digits to block by editing the number (\d{6,} will block all the numbers above 99,999, and \d{8,} will block all the numbers above 9,999,999, etc.).

- -- #define Name RotemLiss #define Mail mail-AT-rotemliss-DOT-com #define Site www.rotemliss.com

#define KeyFingerPrint 4AFD 8579 A449 4267 BED9 38E5 6EF8 5B1F EBDE 7AC0

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFEvPY5bvhbH+veesARApz3AJwLJiMEPDy7ZzxemI+Az0CJJ7JWmQCdGPCi Cq3imvfJVOP5CXELwnOP8gw= =HlOA -----END PGP SIGNATURE-----

Martin Jambon

19 Jul 19 Jul

2:49 a.m.

New subject: [Mediawiki-l] Syntax for RegEx

On Tue, 18 Jul 2006, Rotem Liss wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Andy Roberts wrote:

...
On 18/07/06, Dan Davis hokie99cpe+wiki@gmail.com wrote: Just single 1's for now.

Here's another pattern though -

I suffer roughly weekly from a bot which doesn't add any links, it just edits several existing pages and adds a line consisting of about a dozen random digits eg:

300142760257

But different every time. That kind of behaviour, combined with changing IP nos and delays of a few minutes between edits seems to be theoretically impossible to defend against, as well as pointless.

I'm loathe to force login to edit, because the number of genuine contributions does drop a little when I resort to that.

If you want to block big numbers without commas, you can use: \d{7,} ? this should block every number above 999,999 which contains no commas, although I haven't checked it. You can tweak the minimal number of digits to block by editing the number (\d{6,} will block all the numbers above 99,999, and \d{8,} will block all the numbers above 9,999,999, etc.).

I came up with this pattern:

(?<!\S)\d{12,}(?!\S)

which means "at least 12 digits which are not preceded or followed by non-whitespace characters". The idea is to allow URLs which contain long sequences of digits.

Martin

-- Martin Jambon, PhD http://martin.jambon.free.fr

Andy Roberts

5:36 a.m.

New subject: [Mediawiki-l] Syntax for RegEx

On 18/07/06, Martin Jambon martin_jambon@emailuser.net wrote:

...

On Tue, 18 Jul 2006, Rotem Liss wrote:

...
Andy Roberts wrote:

...
Here's another pattern though -

I suffer roughly weekly from a bot which doesn't add any links, it just edits several existing pages and adds a line consisting of about a dozen random digits eg:
 300142760257
But different every time. That kind of behaviour, combined with changing IP nos and delays of a few minutes between edits seems to be theoretically impossible to defend against, as well as pointless.

I'm loathe to force login to edit, because the number of genuine contributions does drop a little when I resort to that.
If you want to block big numbers without commas, you can use: \d{7,} ? this should block every number above 999,999 which contains no commas, although I haven't checked it. You can tweak the minimal number of digits to block by editing the number (\d{6,} will block all the numbers above 99,999, and \d{8,} will block all the numbers above 9,999,999, etc.).
I came up with this pattern:

(?<!\S)\d{12,}(?!\S)

which means "at least 12 digits which are not preceded or followed by non-whitespace characters". The idea is to allow URLs which contain long sequences of digits.

Thanks Martin.

I like the way you were able to translate your RegEx and rationale into logical English.

-- Andy Roberts http://distributedresearch.net/blog/

6574

Age (days ago)

6574

Last active (days ago)

mediawiki-l@lists.wikimedia.org

9 comments

5 participants

tags (0)

participants (5)

Andy Roberts
Dan Davis
Martin Jambon
Rick DeNatale
Rotem Liss