Spam filters for wikidata.org

List overview All Threads
Download

newer

older

LabeledSectionTransclusion...

Really Fast Merges

Daniel Kinzler

4 Dec 2012 4 Dec '12

10:52 a.m.

Hi!

Once wikidata.org allows for entry of arbitrary properties, we will need some protection against spam. However, there is a nasty little problem with making SpamBlacklist, AntiBot, AbuseFilter etc work with Wikidata content:

Wikibase implements editing directly via the API, but using EditPage. But the spam filters usually hook into EditPage, typically using the EditFilter or EditFilterMerged resp EditFilterMergedContent.

Wikibase has a utility class called EditEntity which implements many things otherwise done by the EditPage: token checks, conflict detection and resolution, permission checks, etc. We could just trigger EditFilterMergedContent there, and also EditFilterMerged and EditFilter, though we would have to fake the "text" for these.

There is one problem with this though: These hooks take as their first parameter an EnditPage object, and the handler functions defined in the various extensions make use of this. Often, just to get the context, like page title, etc - but often enough also for non-trivial things, like calling EditPage::spamPage() or even EditPage::spamPageWithContent().

How can we handle this? I see several possibilities:

1) change the definition of the hook so it just has a ContextSource as it's first parameter, and fix all extensions that use the hook. However, it is unclear how functionality like EditPage::spamPageWithContent() can then be implemented. EditPage::spamPage() could be moved to a utility class, or into OutputPage.

2) emulate an EditPage object, using a proxy/stub/dummy object. This would need a bit of coding, and it's prone to get out of sync with the real EditPage. But things like spamPageWithContent() could be implemented nicely, in a content model specific manner.

3) we could instantiate a dummy EditPage, and pass that to the hooks. But EditPage doesn't support non-text content, and even if we force it, we are likely to end up with an edit field full of json, if we are not very careful.

4) just add another hook, similar to EditFilterMergedContent, but more generic, and call it in EditEntity (and perhaps also in EditPage!). If we want a spam filter extension to work with non-text content, it will have to implement that new hook.

What's the best option, do you think?

There's another closely related problem, btw: showing captchas. How can that be implemented at all for API based, atomic edits? Would the API return a special error, which includes a link to the captcha image as a challange? And then requires thecaptcha's solution via some special arguments to the module call? How can an extension controll this? How is this done for the API's action=edit at present?

thanks, daniel

Show replies by date

Brad Jorsch

4 Dec 4 Dec

3:38 p.m.

On Tue, Dec 4, 2012 at 4:52 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...

There's another closely related problem, btw: showing captchas. How can that be implemented at all for API based, atomic edits? Would the API return a special error, which includes a link to the captcha image as a challange? And then requires thecaptcha's solution via some special arguments to the module call? How can an extension controll this? How is this done for the API's action=edit at present?

The ConfirmEdit extension hooks APIGetAllowedParams and APIGetParamDescription to add its info to the help output, and APIEditBeforeSave to check the captcha and/or add the captcha items to the response.

Matthew Flaschen

6:20 p.m.

On 12/04/2012 04:52 AM, Daniel Kinzler wrote:

...

just add another hook, similar to EditFilterMergedContent, but more generic,

and call it in EditEntity (and perhaps also in EditPage!). If we want a spam filter extension to work with non-text content, it will have to implement that new hook.

I think that makes sense. The spam filters will work best if they are aware of how wikidata works, and have access to the full JSON information of the change.

Matt Flaschen

Daniel Kinzler

5 Dec 5 Dec

12:34 p.m.

On 04.12.2012 18:20, Matthew Flaschen wrote:

...

On 12/04/2012 04:52 AM, Daniel Kinzler wrote:

...

just add another hook, similar to EditFilterMergedContent, but more generic,

and call it in EditEntity (and perhaps also in EditPage!). If we want a spam filter extension to work with non-text content, it will have to implement that new hook.

I think that makes sense. The spam filters will work best if they are aware of how wikidata works, and have access to the full JSON information of the change.

You really want the spam filter extensions to have internal knowledge of Wikibase? That seems like a nasty cross-dependency, and goes directly against the idea of modularization and separation of concerns...

We are running into the "glue code problem" here. We need code that knows about the spam filters and about wikibase. Should it be in the spam filter, in Wikibase, or in a separate, third extension? That would be cleanest, but a hassle to maintain... Which way would you prefer?

-- daniel

Chris Steipp

6:28 p.m.

On Wed, Dec 5, 2012 at 3:34 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...

You really want the spam filter extensions to have internal knowledge of Wikibase? That seems like a nasty cross-dependency, and goes directly against the idea of modularization and separation of concerns...

We are running into the "glue code problem" here. We need code that knows about the spam filters and about wikibase. Should it be in the spam filter, in Wikibase, or in a separate, third extension? That would be cleanest, but a hassle to maintain... Which way would you prefer?

I think Daniel has correctly stated the problem.

My perspective:

One of the directions of the Admin Tools project is to combine some of the various tools into AbuseFilter, so I think it's safe to assume that AbuseFilter will be around and maintained for some time, and Wikidata could easily use the hooks it provides to do a lot of the work providing the interface. That being said, expanding AbuseFilter to work on non-article data has already been requested a few times, so I think we can make AbuseFilter much easier for Wikidata, and AFT to plug into.

Maybe to start with, we can find out what functionality from AbuseFilter there is common between AFT and Wikibase, and try to build in most of the overlapping pieces into AbuseFilter. Then each can also use the AbuseFilter hooks to complete the functionality?

Matthew Flaschen

10:11 p.m.

On 12/05/2012 12:28 PM, Chris Steipp wrote:

...

On Wed, Dec 5, 2012 at 3:34 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...
You really want the spam filter extensions to have internal knowledge of Wikibase? That seems like a nasty cross-dependency, and goes directly against the idea of modularization and separation of concerns...

We are running into the "glue code problem" here. We need code that knows about the spam filters and about wikibase. Should it be in the spam filter, in Wikibase, or in a separate, third extension? That would be cleanest, but a hassle to maintain... Which way would you prefer?

I think Daniel has correctly stated the problem.

My perspective:

One of the directions of the Admin Tools project is to combine some of the various tools into AbuseFilter, so I think it's safe to assume that AbuseFilter will be around and maintained for some time, and Wikidata could easily use the hooks it provides to do a lot of the work providing the interface.

It makes sense for AbuseFilter and Wikidata to work in conjunction. But it seems Wikidata should provide a hook that AbuseFilter calls.

What if someone wants to make spam filter that works differently than AbuseFilter? For example, it uses its own programmatic rules rather than ones that can be expressed in the Special:AbuseFilter language.

If Wikidata exposes an API, AbuseFilter and other extensions can use it.

Matt Flaschen

Chris Steipp

11:54 p.m.

On Wed, Dec 5, 2012 at 1:11 PM, Matthew Flaschen mflaschen@wikimedia.org wrote:

...

It makes sense for AbuseFilter and Wikidata to work in conjunction. But it seems Wikidata should provide a hook that AbuseFilter calls.

I think we agree on this point, although I'll clarify and say I think AbuseFilter should be calling wfRunHooks, and Wikibase should provide the functions. I think more 3rd-party wikis will run AbuseFilter than Wikibase, but that could be my prejudice based on what I work on.

...

What if someone wants to make spam filter that works differently than AbuseFilter? For example, it uses its own programmatic rules rather than ones that can be expressed in the Special:AbuseFilter language.

You are correct, AbuseFilter doesn't currently have hooks to let an extension run its own logic, but that wouldn't be too difficult to implement. Maybe run a new hook from AbuseFilter::checkConditions? Although I would be interested to know what kind of rules you have in mind, since it's certainly possible that we would want to implement it as a AbuseFilter operation.

Matthew Flaschen

6 Dec 6 Dec

12:53 a.m.

On 12/05/2012 05:54 PM, Chris Steipp wrote:

...

On Wed, Dec 5, 2012 at 1:11 PM, Matthew Flaschen mflaschen@wikimedia.org wrote:

...
It makes sense for AbuseFilter and Wikidata to work in conjunction. But it seems Wikidata should provide a hook that AbuseFilter calls.

I think we agree on this point, although I'll clarify and say I think AbuseFilter should be calling wfRunHooks, and Wikibase should provide the functions.

No, we disagree on this.

Wikibase should call wfRunHooks. This is analogous to the way it is now for regular wikitext.

For example, AbuseFilter has:

$wgHooks['EditFilterMerged'][] = 'AbuseFilterHooks::onEditFilterMerged';

Then, core MediaWiki calls:

if ( !wfRunHooks( 'EditFilterMerged', array( $this, $this->textbox1, &$this->hookError, $this->summary ) ) ) {

The same general idea should apply for Wikibase. The only difference is that the core functionality of data editing is in Wikibase.

Thus, Wikibase should call wfRunHooks for this.

...

...
What if someone wants to make spam filter that works differently than AbuseFilter? For example, it uses its own programmatic rules rather than ones that can be expressed in the Special:AbuseFilter language.

You are correct, AbuseFilter doesn't currently have hooks to let an extension run its own logic, but that wouldn't be too difficult to implement.

I don't think it necessarily needs one. A spam filter with a different approach (which may not have a rule UI at all) can register its own hooks, just as AbuseFilter does.

...

Although I would be interested to know what kind of rules you have in mind, since it's certainly possible that we would want to implement it as a AbuseFilter operation.

I don't have an immediate practical suggestion. But I do know that modern spam filters use a variety of approaches, including Bayesian filtering.

Matt Flaschen

Chris Steipp

1:55 a.m.

On Wed, Dec 5, 2012 at 3:53 PM, Matthew Flaschen mflaschen@wikimedia.org wrote:

...

No, we disagree on this.

I was afraid that might be the case, so I'm glad we clarified.

...

The same general idea should apply for Wikibase. The only difference is that the core functionality of data editing is in Wikibase.

Correct, and I would say that Wikibase should be calling the same hooks that core does, so that AbuseFilter can be used to filter all incoming data. If Wikibase wants to define another hook, and can present the data in a generic way (like Daniel did for content handler) we can probably add it into AbuseFilter. But if the processing is specific to Wikibase (you pass an Entity into the hook, for example), then AbuseFilter shouldn't be hooking into something like that, since it would basically make Wikibase a dependency, and I do think that more independent wikis are likely to have AbuseFilter installed without Wikibase than with it.

...

I don't think it necessarily needs one. A spam filter with a different approach (which may not have a rule UI at all) can register its own hooks, just as AbuseFilter does.

I can definitely appreciate that, but that is also why we currently have so many extensions for spam / bot handling, using the existing hooks. I would hate to see yet another spam extension that does really great spam detection, but is has a dependency on Wikibase.

But that's just my preference.

Matthew Flaschen

2:53 a.m.

On 12/05/2012 07:55 PM, Chris Steipp wrote:

...

If Wikibase wants to define another hook, and can present the data in a generic way (like Daniel did for content handler) we can probably add it into AbuseFilter.

It should be presented in a suitable way (not obscure Wikibase internal structures), that still includes the necessary information.

...

But if the processing is specific to Wikibase (you pass an Entity into the hook, for example), then AbuseFilter shouldn't be hooking into something like that, since it would basically make Wikibase a dependency, and I do think that more independent wikis are likely to have AbuseFilter installed without Wikibase than with it.

AbuseFilter would not depend on Wikibase if AbuseFilter only hooks into it.

It's fine for you to register a hook that is never called:

$wgHooks[ 'WikibaseEditFilterMerged' ][] = 'AbuseFilter::onWikibaseEditFilterMerged';

will not cause an error if Wikibase is not installed. onWikibaseEditFilterMerged would then transform the data and call internal AbuseFilter functions/methods.

...

...
I don't think it necessarily needs one. A spam filter with a different approach (which may not have a rule UI at all) can register its own hooks, just as AbuseFilter does.

I can definitely appreciate that, but that is also why we currently have so many extensions for spam / bot handling, using the existing hooks. I would hate to see yet another spam extension that does really great spam detection, but is has a dependency on Wikibase.

I think inevitably different people are going to address the spam challenge differently. By using hooks, though, that great extension does not need a hard dependency on Wikibase.

Matt Flaschen

Daniel Kinzler

11:18 a.m.

On 06.12.2012 01:55, Chris Steipp wrote:

...

...
The same general idea should apply for Wikibase. The only difference is that the core functionality of data editing is in Wikibase.

Correct, and I would say that Wikibase should be calling the same hooks that core does, so that AbuseFilter can be used to filter all incoming data.

That would be great, but as I pointed out in my original mail, not really possible: the existing hooks guarantee an EditPage as a parameter. There is no EditPage when editing Wikibase content, and I can see no sensible way to create one for this purpose.

...

If Wikibase wants to define another hook, and can present the data in a generic way (like Daniel did for content handler) we can probably add it into AbuseFilter.

We can present (some of) the data as plain text, but that removes a lot of information that could be used for spam detection. Maybe AbuseFilter is flexible enough to be able to handle more aspects using "variables". But that would require Wikibase to know about AbuseFilter, and specifically cater to it (or the other way around).

...

But if the processing is specific to Wikibase (you pass an Entity into the hook, for example), then AbuseFilter shouldn't be hooking into something like that, since it would basically make Wikibase a dependency, and I do think that more independent wikis are likely to have AbuseFilter installed without Wikibase than with it.

No, that is not a dependency in the strong sense; You could easily run one without the other. But it does imply knowledge. So, should Wikibase have knowledge of, and contain code specific to, AbuseFilter, or the other way around?

Honestly, I don't like either very much.

...

...
I don't think it necessarily needs one. A spam filter with a different approach (which may not have a rule UI at all) can register its own hooks, just as AbuseFilter does.

But then Wikibase needs to know about each of them, and implement hook handlers for each. Or am I misunderstanding you?

So... we are still facing the Glue Code Dilemma.

-- daniel

Matthew Flaschen

5 Dec 5 Dec

10:06 p.m.

On 12/05/2012 06:34 AM, Daniel Kinzler wrote:

...

...
I think that makes sense. The spam filters will work best if they are aware of how wikidata works, and have access to the full JSON information of the change.

You really want the spam filter extensions to have internal knowledge of Wikibase? That seems like a nasty cross-dependency, and goes directly against the idea of modularization and separation of concerns...

I agree it should not have internal implementation knowledge. I meant how it works in a different sense.

More specifically, what if Wikidata exposed a JSON object representing an external version of each change (essentially a data API).

It could allow hooks to register for this (I think is similar to the EditEntity idea).

Matt Flaschen

Daniel Kinzler

6 Dec 6 Dec

11:22 a.m.

On 05.12.2012 22:06, Matthew Flaschen wrote:

...

More specifically, what if Wikidata exposed a JSON object representing an external version of each change (essentially a data API).

This already exists, that's more or less how changes get pushed to client wikis.

...

It could allow hooks to register for this (I think is similar to the EditEntity idea).

Pretty much the same, actually, yes. Wikibase defines a hook and provides the data structure. Then, AbuseFilter would need knowledge about Wikibase's data model(s).

-- daniel

Matthew Flaschen

9:48 p.m.

On 12/06/2012 05:22 AM, Daniel Kinzler wrote:

...

On 05.12.2012 22:06, Matthew Flaschen wrote:

...
More specifically, what if Wikidata exposed a JSON object representing an external version of each change (essentially a data API).

This already exists, that's more or less how changes get pushed to client wikis.

...
It could allow hooks to register for this (I think is similar to the EditEntity idea).

Pretty much the same, actually, yes. Wikibase defines a hook and provides the data structure. Then, AbuseFilter would need knowledge about Wikibase's data model(s).

Right, but as you said that doesn't introduce a strong/hard dependency. I think this is the best solution.

Matt Flaschen

4368

Age (days ago)

4370

Last active (days ago)

wikitech-l@lists.wikimedia.org

13 comments

4 participants

tags (0)

participants (4)

Brad Jorsch
Chris Steipp
Daniel Kinzler
Matthew Flaschen