Hi,
I am Anubhav Agarwal, a B.Tech 4th Year student at IIT Roorkee. I wish to apply for GSoC 2013 and I am thinking about Bayesian Spam Filter as a project for the same. I have drafted the Idea on my talkhttp://www.mediawiki.org/wiki/User:Anubhav_iitrpage.
I request you to go through this and give your suggestions on it.
Hoping for a good feedback
Regards, Anubhav
Anubhav Agarwal| 4rth Year | Computer Science & Engineering | IIT Roorkee
Hi Anubhav,
On 04/07/2013 06:05 PM, anubhav agarwal wrote:
Hi,
I am Anubhav Agarwal, a B.Tech 4th Year student at IIT Roorkee. I wish to apply for GSoC 2013 and I am thinking about Bayesian Spam Filter as a project for the same. I have drafted the Idea on my talkhttp://www.mediawiki.org/wiki/User:Anubhav_iitrpage.
I have done a first reality check with Chris Steipp, who oversees the area of security and also spam prevention. Your idea is interesting and it seems to be feasible. This is a very good first step!
It would require adding a hook to MediaWiki core, but this could be a small, acceptable change. The rest could be developed as an extension of the ConfirmEdit extension.
It might have a performance penalty in a site like English Wikipedia with plenty of concurrent edits, but for starters it could be potentially useful to the 99% of MediaWiki instances that have a significantly smaller number of daily edits and especially a very small number of editors and tools able / happy to deal with spam.
As a next step, please
1. Create a subpage for your proposal e.g. http://www.mediawiki.org/wiki/User:Anubhav_iitr/Bayesan_spam_filter
2. File an enhancement request at https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions under "Extensions requests" explaining your proposal and linking to the related wiki page.
3. Reply to this thread sharing the link to the bug report so anybody interested can watch it.
I request you to go through this and give your suggestions on it.
Yes, but you will get more feedback if you are diligent answering to the feedback received:
http://www.mediawiki.org/wiki/User_talk:Anubhav_iitr :)
On 09/04/13 18:20, Quim Gil wrote:
Hi Anubhav,
I have done a first reality check with Chris Steipp, who oversees the area of security and also spam prevention. Your idea is interesting and it seems to be feasible. This is a very good first step!
It would require adding a hook to MediaWiki core, but this could be a small, acceptable change.
I agree. Adding a hook is no problem.
The rest could be developed as an extension of the ConfirmEdit extension.
I'm not sure on adding it to ConfirmEdit. I would develop it as an independent extension, which could then hook into ConfirmEdit or AbuseFilter.
Anubhav wrote:
Tasks
Create a tool for wiki users to report Spam. A a simple way to train the a Bayesian DB. This should be accessible for any user with the permissions to "undo" or "rollback" those changes or to delete the new page/file. Understanding the metadata(IP, links, user) I can extract from the data (perhaps harnessing other services like blacklists).
I think it would be more interesting if it could be trained automatically. Perhaps by automatically learning rollbacks as "wrong". Maybe there could be a checkbox to "train as spam" when doing a revert, but I would avoid anything complex like "Go to Special:TrainSpam and enter the revision number to mark as spam".
Good luck!
On 2013-04-12 7:33 PM, "Platonides" Platonides@gmail.com wrote:
On 09/04/13 18:20, Quim Gil wrote:
Hi Anubhav,
I have done a first reality check with Chris Steipp, who oversees the area of security and also spam prevention. Your idea is interesting and it seems to be feasible. This is a very good first step!
It would require adding a hook to MediaWiki core, but this could be a small, acceptable change.
I agree. Adding a hook is no problem.
Well a hook is obviously no problem, im not sure why a new one would be needed. Surely if the abuse filter has all the hooks it needs, so would this.
Qgill wrote:
It might have a performance penalty in a site like English Wikipedia with
plenty of concurrent edits, but for starters it could be potentially useful to the 99% of MediaWiki instances that have a significantly smaller number of daily edits and especially a very small number of editors and tools able / happy to deal with spam.
Hmm. I was playing with nlp-ish automated newpage patrol recently. One thing that crossed my mind was if it becomes too expensive, one could run the classifier in the job queue (and hence on a dedicated server(s) ) and tag changes shortly after the fact.
Last of all I would suggest you also read up on other people who have done machine learning approaches to vandalism detection. In particular user:cluebot_NG - http://en.wikipedia.org/wiki/User:Cluebot_NG . There is also a list of academic papers on the subject at http://en.wikipedia.org/w/index.php?title=User:Emijrp/Anti-vandalism_bot_cen... said, an extension like you are proposing does not have to be as good as the rather complex state of the art in order to be useful. Any effective system would probably be quite useful).
-bawolff
On Sat, Apr 13, 2013 at 2:42 AM, Brian Wolff bawolff@gmail.com wrote:
Qgill wrote:
It might have a performance penalty in a site like English Wikipedia with
plenty of concurrent edits, but for starters it could be potentially useful to the 99% of MediaWiki instances that have a significantly smaller number of daily edits and especially a very small number of editors and tools able / happy to deal with spam.
Hmm. I was playing with nlp-ish automated newpage patrol recently. One thing that crossed my mind was if it becomes too expensive, one could run the classifier in the job queue (and hence on a dedicated server(s) ) and tag changes shortly after the fact.
We have Parsoid running separately, don't we? Perhaps, the same approach could work here as well.
-bawolff _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- З павагай, Павел Селіцкас/Pavel Selitskas Wizardist @ Wikimedia projects
Hi Platonides,
On Sat, Apr 13, 2013 at 4:04 AM, Platonides Platonides@gmail.com wrote:
On 09/04/13 18:20, Quim Gil wrote:
Hi Anubhav,
I have done a first reality check with Chris Steipp, who oversees the area of security and also spam prevention. Your idea is interesting and it seems to be feasible. This is a very good first step!
It would require adding a hook to MediaWiki core, but this could be a small, acceptable change.
I agree. Adding a hook is no problem.
The rest could be developed as an extension of the ConfirmEdit extension.
I'm not sure on adding it to ConfirmEdit. I would develop it as an independent extension, which could then hook into ConfirmEdit or AbuseFilter.
Anubhav wrote:
Tasks
Create a tool for wiki users to report Spam. A a simple way to train the a Bayesian DB. This should be accessible for any user with the permissions to "undo" or "rollback" those changes or to delete the new page/file. Understanding the metadata(IP, links, user) I can extract from the data (perhaps harnessing other services like blacklists).
I think it would be more interesting if it could be trained automatically. Perhaps by automatically learning rollbacks as "wrong". Maybe there could be a checkbox to "train as spam" when doing a revert, but I would avoid anything complex like "Go to Special:TrainSpam and enter the revision number to mark as spam".
I don't we could take in account the roll back for automated learning. It is not necessary that the person who edited the document, then rolled it back did because it was a spam.
Though a "Train as spam" checkbox is a good idea. I was thinking about the "report spam" button along with "edit" button on the top-right hand corner of a section.
Good luck!
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 14/04/13 15:41, anubhav agarwal wrote:
I don't we could take in account the roll back for automated learning. It is not necessary that the person who edited the document, then rolled it back did because it was a spam.
Getting the right data to train from is hard, since wiki is so flexible. The good point of rollback is that a) It's easy to detect, b) It's restricted (a random user can't use it) and c) On some wikis policy restricts it's use to “clearly bad edits”.
So you _should_ be training with "unwanted edits". But there will be false positives.
Though a "Train as spam" checkbox is a good idea. I was thinking about the "report spam" button along with "edit" button on the top-right hand corner of a section.
However, that only tells you that "somewhere in the page there is spam", not what the spam is (the last revision? an edit from 2 months ago?) nor does it encourage for fixing it.
I was thinking of creating a Job Queue for big websites like Wikipedia, each edit will go in a queue which will be processed offline and then later roll backed to the original content if it triggers the alarm.
I'm not a big fan of this. You will have edit-conflicts to handle, and it looks messy to have reverts by an extension. I recommend you to work on the bayesian detection of spam, and leave the potential refactoring to configure it to work through the job queue for later.
I think I could look in the archives of deleted pages from the WM-ES wiki for spam data for you.
Hey Quim,
Thanks for such a detailed response. Sorry for being inactive for these few days, I was undergoing some coursework evaluations.
On Tue, Apr 9, 2013 at 9:50 PM, Quim Gil qgil@wikimedia.org wrote:
Hi Anubhav,
On 04/07/2013 06:05 PM, anubhav agarwal wrote:
Hi,
I am Anubhav Agarwal, a B.Tech 4th Year student at IIT Roorkee. I wish to apply for GSoC 2013 and I am thinking about Bayesian Spam Filter as a project for the same. I have drafted the Idea on my talk<http://www.mediawiki.org/**wiki/User:Anubhav_iitrhttp://www.mediawiki.org/wiki/User:Anubhav_iitr
page.
I have done a first reality check with Chris Steipp, who oversees the area of security and also spam prevention. Your idea is interesting and it seems to be feasible. This is a very good first step!
It would require adding a hook to MediaWiki core, but this could be a small, acceptable change. The rest could be developed as an extension of the ConfirmEdit extension.
It might have a performance penalty in a site like English Wikipedia with plenty of concurrent edits, but for starters it could be potentially useful to the 99% of MediaWiki instances that have a significantly smaller number of daily edits and especially a very small number of editors and tools able / happy to deal with spam.
I was thinking of creating a Job Queue for big websites like Wikipedia, each edit will go in a queue which will be processed offline and then later roll backed to the original content if it triggers the alarm.
As a next step, please
- Create a subpage for your proposal e.g. http://www.mediawiki.org/wiki/*
*User:Anubhav_iitr/Bayesan_**spam_filterhttp://www.mediawiki.org/wiki/User:Anubhav_iitr/Bayesan_spam_filter
- File an enhancement request at https://bugzilla.wikimedia.**
org/enter_bug.cgi?product=**MediaWiki%20extensionshttps://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensionsunder "Extensions requests" explaining your proposal and linking to the related wiki page.
- Reply to this thread sharing the link to the bug report so anybody
interested can watch it.
Here is the link for the bughttps://bugzilla.wikimedia.org/show_bug.cgi?id=47207, as you said.
I request you to go through this and give your suggestions on it.
Yes, but you will get more feedback if you are diligent answering to the feedback received:
http://www.mediawiki.org/wiki/**User_talk:Anubhav_iitrhttp://www.mediawiki.org/wiki/User_talk:Anubhav_iitr :)
-- Quim Gil Technical Contributor Coordinator @ Wikimedia Foundation http://www.mediawiki.org/wiki/**User:Qgilhttp://www.mediawiki.org/wiki/User:Qgil
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 04/14/2013 06:34 AM, anubhav agarwal wrote:
Hey Quim,
Thanks for such a detailed response. Sorry for being inactive for these few days, I was undergoing some coursework evaluations.
I hope they went well. First things first!
You have some homework to do here as well. It is time to start drafting your application, open a related feature request in Bugzilla and find a mentor. See
https://www.mediawiki.org/wiki/Mentorship_programs/Application_template
Hey Quim,
I have drafted my proposal on my User pagehttps://www.mediawiki.org/wiki/User:Anubhav_iitr. I have already opened a bug in mediawiki for the Extension request in bugzilla. Here is the linkhttps://bugzilla.wikimedia.org/show_bug.cgi?id=47207.
I will be glad to have your feedback. Can you suggest me whom I should I ask to mentor me ?
On Mon, Apr 15, 2013 at 10:50 PM, Quim Gil qgil@wikimedia.org wrote:
On 04/14/2013 06:34 AM, anubhav agarwal wrote:
Hey Quim,
Thanks for such a detailed response. Sorry for being inactive for these few days, I was undergoing some coursework evaluations.
I hope they went well. First things first!
You have some homework to do here as well. It is time to start drafting your application, open a related feature request in Bugzilla and find a mentor. See
https://www.mediawiki.org/**wiki/Mentorship_programs/** Application_templatehttps://www.mediawiki.org/wiki/Mentorship_programs/Application_template
-- Quim Gil Technical Contributor Coordinator @ Wikimedia Foundation http://www.mediawiki.org/wiki/**User:Qgilhttp://www.mediawiki.org/wiki/User:Qgil
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 04/23/2013 05:42 AM, anubhav agarwal wrote:
Hey Quim,
I have drafted my proposal on my User pagehttps://www.mediawiki.org/wiki/User:Anubhav_iitr. I have already opened a bug in mediawiki for the Extension request in bugzilla. Here is the linkhttps://bugzilla.wikimedia.org/show_bug.cgi?id=47207.
I will be glad to have your feedback. Can you suggest me whom I should I ask to mentor me ?
Chris is willing to co-mentor, but not alone. I asked another potential co-mentor but we are still waiting for his answer. Anybody interested? MediaWiki extension development skills required.
In any case, please apply to GSoC formally. You don't need to have the mentors assigned to do this and you can keep improving your proposal until the deadline.
wikitech-l@lists.wikimedia.org