AntiSpoof issues

List overview All Threads
Download

newer

older

Integrating Yulup resp. Neutron...

MediaWiki automated test run...

Tim Starling

12 Nov 2006 12 Nov '06

3:36 p.m.

We've been having quite a few complaints about false positives from the AntiSpoof extension -- an extension which attempts to prevent registration of names which are confusingly similar to names already registered. Brion responded to these complaints with "get a sysop to make the account for you", but I don't think that's a very good solution. So I've been working on the AntiSpoof extension today, attempting to make it a bit more relaxed.

The most fundamental problem is the problem of merging sets. Say if we want to treat visually similar characters as part of a set, and we also want to treat letters which are the same except for their case as part of a set. So, for example, say if we have the following pairs:

Η (capital eta) = H (latin) Η (capital eta) = η (lowercase eta) η (lowercase eta) = n (latin)

If we merge all these pairs into a set, following the relations, we obtain the result that latin n is the same as latin H. This is incorrect, and is the cause of most of the bizarre false positives that we see with AntiSpoof.

The problem is that merging sets is fairly fundamental to the way AntiSpoof works -- i.e. by calculating a canonical representation of the username, storing it and indexing it. So it's not going to change any time soon unless we get really clever. But there are some things we can do to minimise its effects.

The first and most obvious thing to do was to remove the transliteration pairs. These are pairs of characters where one member of the pair is a common phonetic transliteration of the other, e.g. cyrillic en "Н" = latin E. This was the cause of most of the spurious conflations between latin characters. This should now be done.

There are now three remaining categories of conflated character pairs: case folding, visual similarity and chinese traditional/simplified conversion.

The second thing to do is to minimise cross-script pairs. Since cross-script usernames are disallowed, cross-script pairs are mostly redundant. You could make a case to leave some of them in, for example some latin usernames can be spoofed entirely using cyrillic characters. And some communities may have a special need for allowing a certain pair of scripts in a username (e.g. latin and hiragana). It's best if we can just keep the pairs which are visually very similar, and consciously avoid including cross-script pairs which will cause false conflations within scripts.

I've done some work on this, but I think it's time to hand over the job to the community, if the community wants it. I've created a page with a big list of pairs, at:

http://www.mediawiki.org/wiki/AntiSpoof/Equivalence_sets

You can edit this page. I will update the live copy on request.

Really clever ideas on how to avoid merging sets while maintaining good performance would be appreciated.

Another misfeature in AntiSpoof which was causing false positives was the fact that it merged sequences of repeated characters. For example, Yuma was considered to be equal to Uma, because Y=U (from a transliteration pair), and UUma = Uma. I've removed this behaviour.

I should really get a blog...

-- Tim Starling

Show replies by date

David Gerard

12 Nov 12 Nov

4:17 p.m.

On 12/11/06, Tim Starling tstarling@wikimedia.org wrote:

...

I should really get a blog...

wikitech.livejournal.com

Even lj user=brionv posts there ... sometimes ...

- d.

Steve Summit

4:58 p.m.

Tim Starling wrote:

...

If we merge all these pairs into a set, following the relations, we obtain the result that latin n is the same as latin H. This is incorrect, and is the cause of most of the bizarre false positives that we see with AntiSpoof.

The problem is that merging sets is fairly fundamental to the way AntiSpoof works....

Clearly a more flexible/sophisticated approach, rather than calling all these characters "equivalent", would be to assign some quantitative visual difference between them, and when traversing a chain such as n -> eta -> Eta -> H, to sum the numbers (or something) rather than considering the equivalences to be a purely transitive relationship.

But obviously that's much more expensive than computing, storing, and indexing a single canonical representation for each string.

A hybrid approach I've contemplated (but not implemented, so I can't prove it works) is to use the canonical representations to generate expansive sets of candidate collisions, but then to do a more sophisticated (perhaps distance-based) comparison of just those candidates, to weed out the false positives.

Anyone interested in this issue should consult Unicode Technical Standard #39, "Unicode Security Mechanisms", at http://www.unicode.org/ reports/tr39/. In particular, its discussion of "confusables" is basically the same issue we're talking about here. See also the Unicode data file "confusables.txt".

Gregory Maxwell

7:16 p.m.

On 11/12/06, Steve Summit scs@eskimo.com wrote: [snip]

...

A hybrid approach I've contemplated (but not implemented, so I can't prove it works) is to use the canonical representations to generate expansive sets of candidate collisions, but then to do a more sophisticated (perhaps distance-based) comparison of just those candidates, to weed out the false positives.

[snip]

Woops. /me reminds self to read thread before replying.

Yes, this is an interesting idea. If anyone codes whats proposed, it would be useful to extend it to support multiple compression functions, for example in addition to the simmar chacter metric it would be useful to have a comparison based on double metaphone:

dmetaphone('Sterling') == dmetaphone('Starling') //Indexed lookup levenshtein('Tim Starling','Tim Sterling') == 1 //Second pass

(I have no clue if php has handy standard library functions for dmetaphone and levenshtein distance.. I'm using the ones in postgresql.)

Simetrical

7:24 p.m.

On 11/12/06, Gregory Maxwell gmaxwell@gmail.com wrote:

...

(I have no clue if php has handy standard library functions for dmetaphone and levenshtein distance.. I'm using the ones in postgresql.)

Levenshtein has a library function: http://us2.php.net/manual/en/function.levenshtein.php

DoubleMetaPhone has at least one PHP implementation, which appears to be maybe free enough for us to use (and I'd guess the author would license it to be free enough if it's not): http://swoodbridge.com/DoubleMetaPhone/

Neil Harris

13 Nov 13 Nov

12:17 a.m.

Steve Summit wrote:

...

Tim Starling wrote:

...
If we merge all these pairs into a set, following the relations, we obtain the result that latin n is the same as latin H. This is incorrect, and is the cause of most of the bizarre false positives that we see with AntiSpoof.

The problem is that merging sets is fairly fundamental to the way AntiSpoof works....

Clearly a more flexible/sophisticated approach, rather than calling all these characters "equivalent", would be to assign some quantitative visual difference between them, and when traversing a chain such as n -> eta -> Eta -> H, to sum the numbers (or something) rather than considering the equivalences to be a purely transitive relationship.

But obviously that's much more expensive than computing, storing, and indexing a single canonical representation for each string.

A hybrid approach I've contemplated (but not implemented, so I can't prove it works) is to use the canonical representations to generate expansive sets of candidate collisions, but then to do a more sophisticated (perhaps distance-based) comparison of just those candidates, to weed out the false positives.

I have already discussed something exactly like the above in E-mail.

As you have suggested above, the idea was to use the big dumb equivalence set table as a first hack to spot possible spoof candidates, and then to apply more sophisticaed processing using, among other things, the UTR#39 confusables.txt tables on up to N of the spoof candidates, falling back to the dumb algorithm if the number of candidates exceeds N, where N is perhaps 20. (This limit is needed to avoid denial of service attacks via the antispoof algorithm.)

Indeed, if this is implemented, the canonicalization function could be made even more of a catch-all, allow ing the catching of even more nasties than the existing code, since the second, more sophisticated, pass would then be able to clean up the larger number of false positives that would be generated by a more aggressive first pass.

I'd be happy to code this up in Python, for translation into PHP.

...

Anyone interested in this issue should consult Unicode Technical Standard #39, "Unicode Security Mechanisms", at http://www.unicode.org/ reports/tr39/. In particular, its discussion of "confusables" is basically the same issue we're talking about here. See also the Unicode data file "confusables.txt".

I'm actively working on this label-spoofing problem for another project, so I'm well aware of UTR #39. As Tim has observed, the current equivalence sets are the transitive closure of the equivalence relations in UTR #39's confusables.txt file (plus some extra nasties), the Unicode uppercasing relationships, and the relationships created by discarding combining marks to uncover the base character. The script-mixing constraints are also taken directly from UTR#39.

I've also got some suggestions that could be added to tighten up the existing integration into MediaWiki, by dealing with a couple of edge cases that are currently less than optimal.

-- Neil

Gregory Maxwell

12 Nov 12 Nov

7:06 p.m.

On 11/12/06, Tim Starling tstarling@wikimedia.org wrote: [snip]

...

The problem is that merging sets is fairly fundamental to the way AntiSpoof works -- i.e. by calculating a canonical representation of the username, storing it and indexing it.

[snip]

Two pass:

Use the current high compression function to locate candate matches nice and quickly from a non-unique index.

Then take the real potential match names and compare them directly using a more intelligent comparison. (i.e. 'n'!='H').

The compression function could be made more lossy so that it will identify a large but not unreasonable number of potentials.

We could even assign points to varrious kinds of matches and deny past a threshold. This would also make it easier to support bi/trigram triggers such as cI ~= d .. which perhaps get more interesting when we consider the entire UTF-8 charset.

Gregory Maxwell

7:24 p.m.

On 11/12/06, Tim Starling tstarling@wikimedia.org wrote:

...

We've been having quite a few complaints about false positives from the AntiSpoof extension -- an extension which attempts to prevent registration

Sorry to post again, another thought on this:

It would probably be useful to reduce the comparison to the set of users with no non-deleted edits. Who care if someone spoofs a deleted user?

A quick DELETE FROM antispoofwhatevertable WHERE NOT EXISTS (SELECT 1 FROM revision WHERE rev_user_text=whatevertable.user limit 1); or the like would accomplish that and cut down on the false positives.

Brion Vibber

13 Nov 13 Nov

9:16 a.m.

Gregory Maxwell wrote:

...

On 11/12/06, Tim Starling tstarling@wikimedia.org wrote:

...
We've been having quite a few complaints about false positives from the AntiSpoof extension -- an extension which attempts to prevent registration

Sorry to post again, another thought on this:

It would probably be useful to reduce the comparison to the set of users with no non-deleted edits. Who care if someone spoofs a deleted user?

We can expect that, say, [[User:Jimbo Wales]] won't actually have any edits on the majority of our wikis. Does that mean there's no need to check for spoofs of that username?

-- brion vibber (brion @ pobox.com)

Alphax (Wikipedia email)

9:41 a.m.

Brion Vibber wrote:

...

Gregory Maxwell wrote:

...
On 11/12/06, Tim Starling tstarling@wikimedia.org wrote:

...
We've been having quite a few complaints about false positives from the AntiSpoof extension -- an extension which attempts to prevent registration

Sorry to post again, another thought on this:

It would probably be useful to reduce the comparison to the set of users with no non-deleted edits. Who care if someone spoofs a deleted user?

We can expect that, say, [[User:Jimbo Wales]] won't actually have any edits on the majority of our wikis. Does that mean there's no need to check for spoofs of that username?

Yet Another Reason why SUL is needed yesterday...

-- Alphax - http://en.wikipedia.org/wiki/User:Alphax Contributor to Wikipedia, the Free Encyclopedia "We make the internet not suck" - Jimbo Wales Public key: http://en.wikipedia.org/wiki/User:Alphax/OpenPGP

Brion Vibber

10:43 a.m.

Alphax (Wikipedia email) wrote:

...

Brion Vibber wrote:

...
Gregory Maxwell wrote:

...
On 11/12/06, Tim Starling tstarling@wikimedia.org wrote:

...
We've been having quite a few complaints about false positives from the AntiSpoof extension -- an extension which attempts to prevent registration

Sorry to post again, another thought on this:

It would probably be useful to reduce the comparison to the set of users with no non-deleted edits. Who care if someone spoofs a deleted user?

We can expect that, say, [[User:Jimbo Wales]] won't actually have any edits on the majority of our wikis. Does that mean there's no need to check for spoofs of that username?

Yet Another Reason why SUL is needed yesterday...

I'll be putting up the merging UI for localization & testing tomorrow.

-- brion vibber (brion @ pobox.com)

Tim Starling

10:57 a.m.

Brion Vibber wrote:

...

We can expect that, say, [[User:Jimbo Wales]] won't actually have any edits on the majority of our wikis. Does that mean there's no need to check for spoofs of that username?

Add an exception list. You can't expect a simple heuristic to get it right in every case. [[User:Michael]] once created LiveJournal accounts impersonating myself and a number of other English Wikipedia sysops. Do you think any anti-spoof technology on LJ's site would have prevented this?

We have lots of users with no edits. Among them, famous people are surely in the minority.

-- Tim Starling

Neil Harris

12:50 a.m.

Tim Starling wrote:

...

We've been having quite a few complaints about false positives from the AntiSpoof extension -- an extension which attempts to prevent registration of names which are confusingly similar to names already registered. Brion responded to these complaints with "get a sysop to make the account for you", but I don't think that's a very good solution. So I've been working on the AntiSpoof extension today, attempting to make it a bit more relaxed.

The most fundamental problem is the problem of merging sets. Say if we want to treat visually similar characters as part of a set, and we also want to treat letters which are the same except for their case as part of a set. So, for example, say if we have the following pairs:

Η (capital eta) = H (latin) Η (capital eta) = η (lowercase eta) η (lowercase eta) = n (latin)

If we merge all these pairs into a set, following the relations, we obtain the result that latin n is the same as latin H. This is incorrect, and is the cause of most of the bizarre false positives that we see with AntiSpoof.

The problem is that merging sets is fairly fundamental to the way AntiSpoof works -- i.e. by calculating a canonical representation of the username, storing it and indexing it. So it's not going to change any time soon unless we get really clever. But there are some things we can do to minimise its effects.

The first and most obvious thing to do was to remove the transliteration pairs. These are pairs of characters where one member of the pair is a common phonetic transliteration of the other, e.g. cyrillic en "Н" = latin E. This was the cause of most of the spurious conflations between latin characters. This should now be done.

There are now three remaining categories of conflated character pairs: case folding, visual similarity and chinese traditional/simplified conversion.

The second thing to do is to minimise cross-script pairs. Since cross-script usernames are disallowed, cross-script pairs are mostly redundant. You could make a case to leave some of them in, for example some latin usernames can be spoofed entirely using cyrillic characters. And some communities may have a special need for allowing a certain pair of scripts in a username (e.g. latin and hiragana). It's best if we can just keep the pairs which are visually very similar, and consciously avoid including cross-script pairs which will cause false conflations within scripts.

I've done some work on this, but I think it's time to hand over the job to the community, if the community wants it. I've created a page with a big list of pairs, at:

http://www.mediawiki.org/wiki/AntiSpoof/Equivalence_sets

You can edit this page. I will update the live copy on request.

Really clever ideas on how to avoid merging sets while maintaining good performance would be appreciated.

Another misfeature in AntiSpoof which was causing false positives was the fact that it merged sequences of repeated characters. For example, Yuma was considered to be equal to Uma, because Y=U (from a transliteration pair), and UUma = Uma. I've removed this behaviour.

I should really get a blog...

-- Tim Starling

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Hi Tim;

I've already thought of this (see my recent E-mail on the Wikitech list -- for some reason, I can't find the lengthy E-mail I thought I'd sent earlier that I refer to there).

Fortunately, not much "real cleverness" is needed.

The basic idea is the one suggested by multiple posters on the list: * an aggressive canonicalization process (which must still have the transitivity requirement above) * looking up candidates with matching canonical forms (up to some limit, perhaps 20, to stop denial-of-service attacks) * if #(candidates) > limit, treat as a spoof, to fail-safe * then a second pass to do the checking _much_ more carefully, without any need for transitivity or over-compression

I'd be happy to E-mail you an implementation in Python of the very simple but more careful second-pass code, as a function are_confusable_strings() that takes two Python strings as input, and returns a boolean value. This can then be called from the PHP pass.

If we do this, we should be able to make the first pass even more aggressive than it is currently, to catch more possible spoof candidates, whilst still eliminating false positives in the second pass, thus improving both the false-positive and false-negative rates to a fraction of their current levels.

We should _not_ remove the cross-script pairs from the list, as there are still whole-script confusables, eg "caxap", "soccer" -- surprisingly, 3% of English dictionary words have matching Cyrillic spoofs, and 1% have Greek spoofs -- however, the second pass should completely eliminate any problems caused by the transitivity in the first pass.

-- Neil

Tim Starling

4:15 a.m.

Neil Harris wrote:

...

Hi Tim;

I've already thought of this (see my recent E-mail on the Wikitech list -- for some reason, I can't find the lengthy E-mail I thought I'd sent earlier that I refer to there).

Fortunately, not much "real cleverness" is needed.

The basic idea is the one suggested by multiple posters on the list:

an aggressive canonicalization process (which must still have the

transitivity requirement above)

looking up candidates with matching canonical forms (up to some limit,

perhaps 20, to stop denial-of-service attacks)

if #(candidates) > limit, treat as a spoof, to fail-safe

then a second pass to do the checking _much_ more carefully, without

any need for transitivity or over-compression

I'd be happy to E-mail you an implementation in Python of the very simple but more careful second-pass code, as a function are_confusable_strings() that takes two Python strings as input, and returns a boolean value. This can then be called from the PHP pass.

Sure, email away.

...

If we do this, we should be able to make the first pass even more aggressive than it is currently, to catch more possible spoof candidates, whilst still eliminating false positives in the second pass, thus improving both the false-positive and false-negative rates to a fraction of their current levels.

Generally speaking, you can't tell whether a given pair of names is an attempted spoof just by comparing the strings. You need to know the motivation of the person who created it. On the one hand we have users who want to find the minimal variation of their given name or Internet nickname that isn't already taken, and on the other hand, we have trolls who want to find the minimal variation of an existing username that isn't disallowed by the software. Both users wish to evade the software restrictions, but one of them has a motivation that we will tolerate, and one of them does not.

As Gregory suggested, one useful heuristic would be to look at the number of edits of the target user. Another one that I proposed on IRC yesterday is a length heuristic -- i.e. collisions of short usernames are more likely to be accidental than collisions of long ones.

...

We should _not_ remove the cross-script pairs from the list, as there are still whole-script confusables, eg "caxap", "soccer" -- surprisingly, 3% of English dictionary words have matching Cyrillic spoofs, and 1% have Greek spoofs -- however, the second pass should completely eliminate any problems caused by the transitivity in the first pass.

We have to remove some of the cross-script pairs until the software is changed, to fix the spurious within-script conflations. I'm not going to make everyone suffer while we have our leisurely chat about possible long-term fixes.

There is a need for judgement, regardless of the software in use. Trolls will go on trolling regardless of what anti-spoofing restrictions we have in place. Our aim should be to minimise their impact, and heuristic systems with a high false positive rate do quite the opposite.

-- Tim Starling

Gregory Maxwell

5:34 a.m.

On 11/12/06, Tim Starling tstarling@wikimedia.org wrote: [snip]

...

There is a need for judgement, regardless of the software in use. Trolls will go on trolling regardless of what anti-spoofing restrictions we have in place. Our aim should be to minimise their impact, and heuristic systems with a high false positive rate do quite the opposite.

This note brings to mind an interesting homework assignment for the list...

Can we think of a good way to impliment "interactive intervention" in mediawiki which neither adds weird backend requirements (works with the nonpersistantness of php) or odd client requirements (no java or the like).

The idea is that we have hundreds of people in IRC.. many people RC patrolling. There are *many* sorts of activities which software can mark as suspect but which require judgement. Is there a reasonable way for us to get that judgement in real-time?

Timwi

14 Nov 14 Nov

6:04 p.m.

Gregory Maxwell wrote:

...

On 11/12/06, Tim Starling tstarling@wikimedia.org wrote: [snip]

...
There is a need for judgement, regardless of the software in use. Trolls will go on trolling regardless of what anti-spoofing restrictions we have in place. Our aim should be to minimise their impact, and heuristic systems with a high false positive rate do quite the opposite.

This note brings to mind an interesting homework assignment for the list...

Can we think of a good way to impliment "interactive intervention" in mediawiki which neither adds weird backend requirements (works with the nonpersistantness of php) or odd client requirements (no java or the like).

The idea is that we have hundreds of people in IRC.. many people RC patrolling. There are *many* sorts of activities which software can mark as suspect but which require judgement. Is there a reasonable way for us to get that judgement in real-time?

But that's easy...

* User tries to create an account * Software responds, "The username you chose is very similar to the username of an existing user. In order to ensure that you are not trying to impersonate someone else, an administrator will have to approve your username manually. Approval is usually processed within <average timeframe>. How do you wish to proceed?" [ Request approval ] [ Try a different username ] * User clicks "Request approval". Software responds, "Your request for approval has been sent off to the administrators. You will receive an e-mail as soon as approval has been granted or rejected." * Either an e-mail is sent to a mailing list, or a wiki page is updated, or (my preferred way) a special dedicated feature in MediaWiki is invoked, which alerts volunteers to the awaiting approval. * An administrator accepts or rejects the request. If it is accepted, the normal welcome e-mail with the confirmation link is sent to the user. Otherwise, an e-mail informs the user of the rejection.

Timwi

Mark Clements

15 Nov 15 Nov

1:40 a.m.

"Timwi" timwi@gmx.net wrote in message news:ejd0fd$pug$1@sea.gmane.org...

...

Gregory Maxwell wrote:

...
On 11/12/06, Tim Starling

tstarling@wikimedia.org wrote:

...

...
[snip]

...
There is a need for judgement, regardless of the software in use. Trolls will go on trolling regardless of what anti-spoofing restrictions we

have in

...

...
...
place. Our aim should be to minimise their impact, and heuristic systems with a high false positive rate do quite the opposite.

This note brings to mind an interesting homework assignment for the

list...

...

...
Can we think of a good way to impliment "interactive intervention" in mediawiki which neither adds weird backend requirements (works with the nonpersistantness of php) or odd client requirements (no java or the like).

The idea is that we have hundreds of people in IRC.. many people RC patrolling. There are *many* sorts of activities which software can mark as suspect but which require judgement. Is there a reasonable way for us to get that judgement in real-time?

But that's easy...

User tries to create an account

Software responds, "The username you chose is very similar to the username of an existing user. In order to ensure that you are not trying to impersonate someone else, an administrator will have to approve your username manually. Approval is usually processed within <average timeframe>. How do you wish to proceed?" [ Request approval ] [ Try a different username ]

User clicks "Request approval". Software responds, "Your request for approval has been sent off to the administrators. You will receive an e-mail as soon as approval has been granted or rejected."

Either an e-mail is sent to a mailing list, or a wiki page is updated, or (my preferred way) a special dedicated feature in MediaWiki is invoked, which alerts volunteers to the awaiting approval.

An administrator accepts or rejects the request. If it is accepted, the normal welcome e-mail with the confirmation link is sent to the user. Otherwise, an e-mail informs the user of the rejection.

Timwi

This requires that the user supplies an e-mail address - not currently a requirement, so far as I know...

-- - Mark Clements (HappyDog)

Timwi

8:17 p.m.

Mark Clements wrote:

...

"Timwi" timwi@gmx.net wrote in message news:ejd0fd$pug$1@sea.gmane.org...

...
Gregory Maxwell wrote:

...
Can we think of a good way to impliment "interactive intervention" in mediawiki which neither adds weird backend requirements (works with the nonpersistantness of php) or odd client requirements (no java or the like).

The idea is that we have hundreds of people in IRC.. many people RC patrolling. There are *many* sorts of activities which software can mark as suspect but which require judgement. Is there a reasonable way for us to get that judgement in real-time?

But that's easy... [long suggestion snipped]

This requires that the user supplies an e-mail address - not currently a requirement, so far as I know...

Do you have any better ideas?

Note that it is still perfectly possible to register without an e-mail address if you don't trigger the anti-spoof system, which I would hope would be the vast majority of cases. If someone is determined to have a certain username because it's their Internet handle (and not because they're trying to impersonate someone), they normally wouldn't mind supplying at least a temporary e-mail address.

Timwi

Johannes Ernst

10 p.m.

Is this a good time to mention that there's a very nice OpenID extension to MediaWiki?

See in action here: http://openid.net/wiki/

On Nov 15, 2006, at 12:17, Timwi wrote:

...

Mark Clements wrote:

...
"Timwi" timwi@gmx.net wrote in message news:ejd0fd$pug$1@sea.gmane.org...

...
Gregory Maxwell wrote:

...
Can we think of a good way to impliment "interactive intervention" in mediawiki which neither adds weird backend requirements (works with the nonpersistantness of php) or odd client requirements (no java or the like).

The idea is that we have hundreds of people in IRC.. many people RC patrolling. There are *many* sorts of activities which software can mark as suspect but which require judgement. Is there a reasonable way for us to get that judgement in real-time?

But that's easy... [long suggestion snipped]

This requires that the user supplies an e-mail address - not currently a requirement, so far as I know...

Do you have any better ideas?

Note that it is still perfectly possible to register without an e-mail address if you don't trigger the anti-spoof system, which I would hope would be the vast majority of cases. If someone is determined to have a certain username because it's their Internet handle (and not because they're trying to impersonate someone), they normally wouldn't mind supplying at least a temporary e-mail address.

Timwi

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Evan Prodromou

16 Nov 16 Nov

1:10 p.m.

New subject: OpenID (was Re: AntiSpoof issues)

On Wed, 2006-15-11 at 14:00 -0800, Johannes Ernst wrote:

...

Is this a good time to mention that there's a very nice OpenID extension to MediaWiki?

It's _always_ a good time to compliment my code. Thanks!

If you need a better explanatory URL, the main homepage is here:

http://www.mediawiki.org/wiki/OpenID_Extension

...

See in action here: http://openid.net/wiki/

It's also in production on Wikitravel.

http://wikitravel.org/en/Wikitravel:Single_sign-on_help

I don't think that OpenID is a realistic fix for username spoofing (since there are at least half-a-dozen free OpenID identity services), but it's always nice to mention.

Brion has said a few times that OpenID will be rolled out on Wikimedia sites once the Single User Login is in place.

-Evan

-- Evan Prodromou evan@prodromou.name

Gregory Maxwell

2:06 a.m.

On 11/15/06, Timwi timwi@gmx.net wrote:

...

...
This requires that the user supplies an e-mail address - not currently a requirement, so far as I know...

Do you have any better ideas?

Note that it is still perfectly possible to register without an e-mail address if you don't trigger the anti-spoof system, which I would hope would be the vast majority of cases. If someone is determined to have a certain username because it's their Internet handle (and not because they're trying to impersonate someone), they normally wouldn't mind supplying at least a temporary e-mail address.

The whole point of my 'real time challenge' was to avoid email. Email is store and forward: i.e. not real-time. :)

I was hoping for something along the lines of sticking them on an auto refreshing page (or some kinda ajax spinner) while a notice was directed to IRC.

I believe we have the human resources on at least some of our projects to do some amount of 'real time' work.. Where the real time work might be, things like:

# Account approval # Edit approval (i.e. detect penis penis penis and hold their edit until someone approves it, or perhaps a new sort of protection...) # Upload approval (users first upload can be inspected while they wait)

A successful system would integrate well with mediawiki (I consider this the hard part), be easy for our (impatient) users, and be able to degrade gracefully. (Track the backlog and if it grows too much, start directing people to the old non-real time methods rather than leaving people waiting for ever).

Alphax (Wikipedia email)

4:05 a.m.

Gregory Maxwell wrote: <snip>

...

# Edit approval (i.e. detect penis penis penis and hold their edit until someone approves it, or perhaps a new sort of protection...)

You seem to have an unhealthy obsession with that...

...

# Upload approval (users first upload can be inspected while they wait)

Could be useful on Commons. Also useful would be a "mark this image as being sane" function, with a queue, that would prevent it from being displayed to non-sysops until reviewed. End of image vandalism forever! End of horribly blatant copyvios too...

Of course, the hard part is getting it written and integrated. Hrm.

David Gerard

8:22 a.m.

On 16/11/06, Alphax (Wikipedia email) alphasigmax@gmail.com wrote:

...

Could be useful on Commons. Also useful would be a "mark this image as being sane" function, with a queue, that would prevent it from being displayed to non-sysops until reviewed. End of image vandalism forever! End of horribly blatant copyvios too...

It's a good thing Commons has lots of sysops spare!

- d.

Jay R. Ashworth

14 Nov 14 Nov

2:45 p.m.

On Mon, Nov 13, 2006 at 03:15:01PM +1100, Tim Starling wrote:

...

Generally speaking, you can't tell whether a given pair of names is an attempted spoof just by comparing the strings. You need to know the motivation of the person who created it.

I think there's a function for that in the Python 2.6 libraries...

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274 "That's women for you; you divorce them, and 10 years later, they stop having sex with you." -- Jennifer Crusie; _Fast_Women_

6458

Age (days ago)

6462

Last active (days ago)

wikitech-l@lists.wikimedia.org

23 comments

14 participants

tags (0)

participants (14)

Alphax (Wikipedia email)
Brion Vibber
David Gerard
Evan Prodromou
Gregory Maxwell
Jay R. Ashworth
Johannes Ernst
Mark Clements
Neil Harris
Neil Harris
Simetrical
Steve Summit
Tim Starling
Timwi