Hi Guys,
I was trying to fix thishttps://bugzilla.wikimedia.org/show_bug.cgi?id=1542bug. I am a newbie to mediawiki and it's a first bug I'm trying to solve, so I don't know much. I want to know about the spam block list, how does it works, how does trigger the action, and its logging mechanism. It would be great if some one could help me fix this bug.
Cheers, Anubhav
Anubhav Agarwal| 4rth Year | Computer Science & Engineering | IIT Roorkee
Hey,
I don't know much about that, or how much you know, but at the very least I can tell you that the bug is in Extension:SpamBlacklist, which can be found at http://www.mediawiki.org/wiki/Extension:SpamBlacklist. From what I can see from the code, it seems to just use various Hooks in MediaWiki in order to stop editing, e-mailing, etc. if the request matches a parsed blacklist it has.
*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On Mon, Feb 25, 2013 at 2:17 PM, anubhav agarwal anubhav914@gmail.comwrote:
Hi Guys,
I was trying to fix thishttps://bugzilla.wikimedia.org/show_bug.cgi?id=1542bug. I am a newbie to mediawiki and it's a first bug I'm trying to solve, so I don't know much. I want to know about the spam block list, how does it works, how does trigger the action, and its logging mechanism. It would be great if some one could help me fix this bug.
Cheers, Anubhav
Anubhav Agarwal| 4rth Year | Computer Science & Engineering | IIT Roorkee _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
That's an ambitious first bug, Anubhav!
Since this is an extension, it plugs into MediaWiki core using hooks. So periodically, the core code will run all of the functions registered for a particular hook, so the extensions can interact with the logic. In this case, SpamBlacklist has registered SpamBlacklistHooks::filterMerged to run whenever an editor attempts to save a page, or SpamBlacklistHooks::filterAPIEditBeforeSave if the edit came in through the api. So that is where you will want to log.
Although MediaWiki has a logging feature, it sounds like you may want to add your own logging table (like the AbuseFilter extension). If you do that, make sure that you're only storing data that you really need, and is ok with our privacy policy (so no ip addresses!).
Feel free to add me as a reviewer when you submit your code to gerrit.
Chris
On Mon, Feb 25, 2013 at 11:21 AM, Tyler Romeo tylerromeo@gmail.com wrote:
Hey,
I don't know much about that, or how much you know, but at the very least I can tell you that the bug is in Extension:SpamBlacklist, which can be found at http://www.mediawiki.org/wiki/Extension:SpamBlacklist. From what I can see from the code, it seems to just use various Hooks in MediaWiki in order to stop editing, e-mailing, etc. if the request matches a parsed blacklist it has.
*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On Mon, Feb 25, 2013 at 2:17 PM, anubhav agarwal anubhav914@gmail.comwrote:
Hi Guys,
I was trying to fix thishttps://bugzilla.wikimedia.org/show_bug.cgi?id=1542bug. I am a newbie to mediawiki and it's a first bug I'm trying to solve, so I don't know much. I want to know about the spam block list, how does it works, how does trigger the action, and its logging mechanism. It would be great if some one could help me fix this bug.
Cheers, Anubhav
Anubhav Agarwal| 4rth Year | Computer Science & Engineering | IIT Roorkee _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hey Chris
I was exploring SpamBlaklist Extension. I have some doubts hope you could clear them.
Is there any place I can get documentation of Class SpamBlacklist in the file SpamBlacklist_body.php. ?
In function filter what does the following variables represent ?
$title $text $section $editpage $out
I have understood the following things from the code, please correct me if I am wrong. It extracts the edited text, and parse it to find the links. It then replaces the links which match the whitelist regex, and then checks if there are some links that match the blacklist regex. If the check is greater you return the content matched. it already enters in the debuglog if it finds a match
I guess the bug aims at creating a sql table. I was thinking of the following fields to log. Title, Text, User, URLs, IP. I don't understand why you denied it.
On Tue, Feb 26, 2013 at 1:25 AM, Chris Steipp csteipp@wikimedia.org wrote:
That's an ambitious first bug, Anubhav!
Since this is an extension, it plugs into MediaWiki core using hooks. So periodically, the core code will run all of the functions registered for a particular hook, so the extensions can interact with the logic. In this case, SpamBlacklist has registered SpamBlacklistHooks::filterMerged to run whenever an editor attempts to save a page, or SpamBlacklistHooks::filterAPIEditBeforeSave if the edit came in through the api. So that is where you will want to log.
Although MediaWiki has a logging feature, it sounds like you may want to add your own logging table (like the AbuseFilter extension). If you do that, make sure that you're only storing data that you really need, and is ok with our privacy policy (so no ip addresses!).
Feel free to add me as a reviewer when you submit your code to gerrit.
Chris
On Mon, Feb 25, 2013 at 11:21 AM, Tyler Romeo tylerromeo@gmail.com wrote:
Hey,
I don't know much about that, or how much you know, but at the very
least I
can tell you that the bug is in Extension:SpamBlacklist, which can be
found
at http://www.mediawiki.org/wiki/Extension:SpamBlacklist. From what I
can
see from the code, it seems to just use various Hooks in MediaWiki in
order
to stop editing, e-mailing, etc. if the request matches a parsed
blacklist
it has.
*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com
On Mon, Feb 25, 2013 at 2:17 PM, anubhav agarwal <anubhav914@gmail.com wrote:
Hi Guys,
I was trying to fix thishttps://bugzilla.wikimedia.org/show_bug.cgi?id=1542bug. I am a newbie to mediawiki and it's a first bug I'm trying to solve, so I don't know much. I want to know about the spam block list, how does it works, how does trigger the action, and its logging mechanism. It would be great if some one could help me fix this bug.
Cheers, Anubhav
Anubhav Agarwal| 4rth Year | Computer Science & Engineering | IIT
Roorkee
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 07/03/13 21:03, anubhav agarwal wrote:
Hey Chris
I was exploring SpamBlaklist Extension. I have some doubts hope you could clear them.
Is there any place I can get documentation of Class SpamBlacklist in the file SpamBlacklist_body.php. ?
In function filter what does the following variables represent ?
$title
Title object (includes/Title.php) This is the page where it tried to save.
$text
Text being saved in the page/section
$section
Name of the section or ''
$editpage
EditPage object if EditFilterMerged was called, null otherwise
$out
A ParserOutput class (actually, this variable name was a bad choice, it looks like a OutputPage), see includes/parser/ParserOutput.php
I have understood the following things from the code, please correct me if I am wrong. It extracts the edited text, and parse it to find the links.
Actually, it uses the fact that the parser will have processed the links, so in most cases just obtains that information.
It then replaces the links which match the whitelist regex,
This doesn't make sense as you explain it. It builds a list of links, and replaces whitelisted ones with '', ie. removes whitelisted links from the list.
and then checks if there are some links that match the blacklist regex.
Yes
If the check is greater you return the content matched.
Right, $check will be non-0 if the links matched the blacklist.
it already enters in the debuglog if it finds a match
Yes, but that is a private log. Bug 1542 talks about making that accesible in the wiki.
I guess the bug aims at creating a sql table. I was thinking of the following fields to log. Title, Text, User, URLs, IP. I don't understand why you denied it.
Because we don't like to publish the IPs *in the wiki*.
I think the approach should be to log matches using abusefilter extension if that one is loaded. I concur that it seems too hard to begin with.
On Thu, Mar 7, 2013 at 1:34 PM, Platonides Platonides@gmail.com wrote:
On 07/03/13 21:03, anubhav agarwal wrote:
Hey Chris
I was exploring SpamBlaklist Extension. I have some doubts hope you could clear them.
Is there any place I can get documentation of Class SpamBlacklist in the file SpamBlacklist_body.php. ?
There really isn't any documentation besides the code, but a couple more things you should look at. Notice that in SpamBlacklist.php, there is the line "$wgHooks['EditFilterMerged'][] = 'SpamBlacklistHooks::filterMerged';", which is the way that SpamBlacklist registers itself with MediaWiki core to filter edits. So when MediaWiki core runs the EditFilterMerged hooks (which it does in includes/EditPage.php, line 1287), all of the extensions that have registered a function for that hook are run with the passed in arguments, so SpamBlacklistHooks::filterMerged is run. And SpamBlacklistHooks::filterMerged then just sets up and calls SpamBlacklist::filter. So that is where you can start tracing what is actually in the variables, in case Platonides summary wasn't enough.
In function filter what does the following variables represent ?
$title
Title object (includes/Title.php) This is the page where it tried to save.
$text
Text being saved in the page/section
$section
Name of the section or ''
$editpage
EditPage object if EditFilterMerged was called, null otherwise
$out
A ParserOutput class (actually, this variable name was a bad choice, it looks like a OutputPage), see includes/parser/ParserOutput.php
I have understood the following things from the code, please correct me if I am wrong. It extracts the edited text, and parse it to find the links.
Actually, it uses the fact that the parser will have processed the links, so in most cases just obtains that information.
It then replaces the links which match the whitelist regex,
This doesn't make sense as you explain it. It builds a list of links, and replaces whitelisted ones with '', ie. removes whitelisted links from the list.
and then checks if there are some links that match the blacklist regex.
Yes
If the check is greater you return the content matched.
Right, $check will be non-0 if the links matched the blacklist.
it already enters in the debuglog if it finds a match
Yes, but that is a private log. Bug 1542 talks about making that accesible in the wiki.
Yep. For example, see * https://en.wikipedia.org/wiki/Special:Log * https://en.wikipedia.org/wiki/Special:AbuseLog
I guess the bug aims at creating a sql table. I was thinking of the following fields to log. Title, Text, User, URLs, IP. I don't understand why you denied it.
Because we don't like to publish the IPs *in the wiki*.
The WMF privacy policy also discourages us from keeping IP addresses longer than 90 days, so if you do keep IPs, then you need a way to hide / purge them, and if they allow someone to see what IP address a particular username was using, then only users with checkuser permissions are allowed to see that. So it would be easier for you not to include it, but if it's desired, then you'll just have to build those protections out too.
I think the approach should be to log matches using abusefilter extension if that one is loaded.
The abusefilter log format has a lot of data in it specific to AbuseFilter, and is used to re-test abuse filters, so adding these hits into that log might cause some issues. I think either the general log, or using a separate, new log table would be best. Just for some numbers, in the first 7 days of this month, we've had an average of 27,000 hits each day. So if this goes into an existing log, it's going to generate a significant amount of data.
Hey Guys,
Thanks for explaining it to me. Can I have your IRC handles, I still think I have many doubts.
Is there a simpler bug related with extension, so I can get an Idea of it working.
On Fri, Mar 8, 2013 at 5:23 AM, Chris Steipp csteipp@wikimedia.org wrote:
On Thu, Mar 7, 2013 at 1:34 PM, Platonides Platonides@gmail.com wrote:
On 07/03/13 21:03, anubhav agarwal wrote:
Hey Chris
I was exploring SpamBlaklist Extension. I have some doubts hope you
could
clear them.
Is there any place I can get documentation of Class SpamBlacklist in the file SpamBlacklist_body.php. ?
There really isn't any documentation besides the code, but a couple more things you should look at. Notice that in SpamBlacklist.php, there is the line "$wgHooks['EditFilterMerged'][] = 'SpamBlacklistHooks::filterMerged';", which is the way that SpamBlacklist registers itself with MediaWiki core to filter edits. So when MediaWiki core runs the EditFilterMerged hooks (which it does in includes/EditPage.php, line 1287), all of the extensions that have registered a function for that hook are run with the passed in arguments, so SpamBlacklistHooks::filterMerged is run. And SpamBlacklistHooks::filterMerged then just sets up and calls SpamBlacklist::filter. So that is where you can start tracing what is actually in the variables, in case Platonides summary wasn't enough.
In function filter what does the following variables represent ?
$title
Title object (includes/Title.php) This is the page where it tried to
save.
$text
Text being saved in the page/section
$section
Name of the section or ''
$editpage
EditPage object if EditFilterMerged was called, null otherwise
$out
A ParserOutput class (actually, this variable name was a bad choice, it looks like a OutputPage), see includes/parser/ParserOutput.php
I have understood the following things from the code, please correct me
if
I am wrong. It extracts the edited text, and parse it to find the links.
Actually, it uses the fact that the parser will have processed the links, so in most cases just obtains that information.
It then replaces the links which match the whitelist regex,
This doesn't make sense as you explain it. It builds a list of links, and replaces whitelisted ones with '', ie. removes whitelisted links from the list.
and then checks if there are some links that match the blacklist regex.
Yes
If the check is greater you return the content matched.
Right, $check will be non-0 if the links matched the blacklist.
it already enters in the debuglog if it finds a match
Yes, but that is a private log. Bug 1542 talks about making that accesible in the wiki.
Yep. For example, see
I guess the bug aims at creating a sql table. I was thinking of the following fields to log. Title, Text, User, URLs, IP. I don't understand why you denied it.
Because we don't like to publish the IPs *in the wiki*.
The WMF privacy policy also discourages us from keeping IP addresses longer than 90 days, so if you do keep IPs, then you need a way to hide / purge them, and if they allow someone to see what IP address a particular username was using, then only users with checkuser permissions are allowed to see that. So it would be easier for you not to include it, but if it's desired, then you'll just have to build those protections out too.
I think the approach should be to log matches using abusefilter extension if that one is loaded.
The abusefilter log format has a lot of data in it specific to AbuseFilter, and is used to re-test abuse filters, so adding these hits into that log might cause some issues. I think either the general log, or using a separate, new log table would be best. Just for some numbers, in the first 7 days of this month, we've had an average of 27,000 hits each day. So if this goes into an existing log, it's going to generate a significant amount of data.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
csteipp. Feel free to ping me whenever. On Mar 8, 2013 6:23 AM, "anubhav agarwal" anubhav914@gmail.com wrote:
Hey Guys,
Thanks for explaining it to me. Can I have your IRC handles, I still think I have many doubts.
Is there a simpler bug related with extension, so I can get an Idea of it working.
On Fri, Mar 8, 2013 at 5:23 AM, Chris Steipp csteipp@wikimedia.org wrote:
On Thu, Mar 7, 2013 at 1:34 PM, Platonides Platonides@gmail.com wrote:
On 07/03/13 21:03, anubhav agarwal wrote:
Hey Chris
I was exploring SpamBlaklist Extension. I have some doubts hope you
could
clear them.
Is there any place I can get documentation of Class SpamBlacklist in the file SpamBlacklist_body.php. ?
There really isn't any documentation besides the code, but a couple more things you should look at. Notice that in SpamBlacklist.php, there is the line "$wgHooks['EditFilterMerged'][] = 'SpamBlacklistHooks::filterMerged';", which is the way that SpamBlacklist registers itself with MediaWiki core to filter edits. So when MediaWiki core runs the EditFilterMerged hooks (which it does in includes/EditPage.php, line 1287), all of the extensions that have registered a function for that hook are run with the passed in arguments, so SpamBlacklistHooks::filterMerged is run. And SpamBlacklistHooks::filterMerged then just sets up and calls SpamBlacklist::filter. So that is where you can start tracing what is actually in the variables, in case Platonides summary wasn't enough.
In function filter what does the following variables represent ?
$title
Title object (includes/Title.php) This is the page where it tried to
save.
$text
Text being saved in the page/section
$section
Name of the section or ''
$editpage
EditPage object if EditFilterMerged was called, null otherwise
$out
A ParserOutput class (actually, this variable name was a bad choice, it looks like a OutputPage), see includes/parser/ParserOutput.php
I have understood the following things from the code, please correct
me
if
I am wrong. It extracts the edited text, and parse it to find the
links.
Actually, it uses the fact that the parser will have processed the links, so in most cases just obtains that information.
It then replaces the links which match the whitelist regex,
This doesn't make sense as you explain it. It builds a list of links, and replaces whitelisted ones with '', ie. removes whitelisted links from the list.
and then checks if there are some links that match the blacklist
regex.
Yes
If the check is greater you return the content matched.
Right, $check will be non-0 if the links matched the blacklist.
it already enters in the debuglog if it finds a match
Yes, but that is a private log. Bug 1542 talks about making that accesible in the wiki.
Yep. For example, see
I guess the bug aims at creating a sql table. I was thinking of the following fields to log. Title, Text, User, URLs, IP. I don't understand why you denied it.
Because we don't like to publish the IPs *in the wiki*.
The WMF privacy policy also discourages us from keeping IP addresses longer than 90 days, so if you do keep IPs, then you need a way to hide / purge them, and if they allow someone to see what IP address a particular username was using, then only users with checkuser permissions are allowed to see that. So it would be easier for you not to include it, but if it's desired, then you'll just have to build those protections out too.
I think the approach should be to log matches using abusefilter extension if that one is loaded.
The abusefilter log format has a lot of data in it specific to AbuseFilter, and is used to re-test abuse filters, so adding these hits into that log might cause some issues. I think either the general log, or using a separate, new log table would be best. Just for some numbers, in the first 7 days of this month, we've had an average of 27,000 hits each day. So if this goes into an existing log, it's going to generate a significant amount of data.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Cheers, Anubhav
Anubhav Agarwal| 4rth Year | Computer Science & Engineering | IIT Roorkee _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hey Chriss
I added some debug lines on filter function of SpamBlacklist Class.
*wfDebugLog('SpamBlacklist', "testing" );*
in the intial step;
but I wasn't able to see any such thing in the Debug Log in debug toolbaar.
Then I added the following line *wfErrorLog("Error here", './error.log' );* * * error.log is a file in the SpamBlaclist folder. I changed its permissions to 777. But still there was no output. Can you guide me how can I debug value of parameters in the filter function.
On Fri, Mar 8, 2013 at 5:23 AM, Chris Steipp csteipp@wikimedia.org wrote:
On Thu, Mar 7, 2013 at 1:34 PM, Platonides Platonides@gmail.com wrote:
On 07/03/13 21:03, anubhav agarwal wrote:
Hey Chris
I was exploring SpamBlaklist Extension. I have some doubts hope you
could
clear them.
Is there any place I can get documentation of Class SpamBlacklist in the file SpamBlacklist_body.php. ?
There really isn't any documentation besides the code, but a couple more things you should look at. Notice that in SpamBlacklist.php, there is the line "$wgHooks['EditFilterMerged'][] = 'SpamBlacklistHooks::filterMerged';", which is the way that SpamBlacklist registers itself with MediaWiki core to filter edits. So when MediaWiki core runs the EditFilterMerged hooks (which it does in includes/EditPage.php, line 1287), all of the extensions that have registered a function for that hook are run with the passed in arguments, so SpamBlacklistHooks::filterMerged is run. And SpamBlacklistHooks::filterMerged then just sets up and calls SpamBlacklist::filter. So that is where you can start tracing what is actually in the variables, in case Platonides summary wasn't enough.
In function filter what does the following variables represent ?
$title
Title object (includes/Title.php) This is the page where it tried to
save.
$text
Text being saved in the page/section
$section
Name of the section or ''
$editpage
EditPage object if EditFilterMerged was called, null otherwise
$out
A ParserOutput class (actually, this variable name was a bad choice, it looks like a OutputPage), see includes/parser/ParserOutput.php
I have understood the following things from the code, please correct me
if
I am wrong. It extracts the edited text, and parse it to find the links.
Actually, it uses the fact that the parser will have processed the links, so in most cases just obtains that information.
It then replaces the links which match the whitelist regex,
This doesn't make sense as you explain it. It builds a list of links, and replaces whitelisted ones with '', ie. removes whitelisted links from the list.
and then checks if there are some links that match the blacklist regex.
Yes
If the check is greater you return the content matched.
Right, $check will be non-0 if the links matched the blacklist.
it already enters in the debuglog if it finds a match
Yes, but that is a private log. Bug 1542 talks about making that accesible in the wiki.
Yep. For example, see
I guess the bug aims at creating a sql table. I was thinking of the following fields to log. Title, Text, User, URLs, IP. I don't understand why you denied it.
Because we don't like to publish the IPs *in the wiki*.
The WMF privacy policy also discourages us from keeping IP addresses longer than 90 days, so if you do keep IPs, then you need a way to hide / purge them, and if they allow someone to see what IP address a particular username was using, then only users with checkuser permissions are allowed to see that. So it would be easier for you not to include it, but if it's desired, then you'll just have to build those protections out too.
I think the approach should be to log matches using abusefilter extension if that one is loaded.
The abusefilter log format has a lot of data in it specific to AbuseFilter, and is used to re-test abuse filters, so adding these hits into that log might cause some issues. I think either the general log, or using a separate, new log table would be best. Just for some numbers, in the first 7 days of this month, we've had an average of 27,000 hits each day. So if this goes into an existing log, it's going to generate a significant amount of data.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi!
There is a proposed change to another extension (aiming to include logs for actions which trigger the CAPTCHA, aka bug 41522), which, although not finished, may be of some help: https://gerrit.wikimedia.org/r/#/c/40553/
Helder
On Mon, Feb 25, 2013 at 4:17 PM, anubhav agarwal anubhav914@gmail.com wrote:
Hi Guys,
I was trying to fix thishttps://bugzilla.wikimedia.org/show_bug.cgi?id=1542bug. I am a newbie to mediawiki and it's a first bug I'm trying to solve, so I don't know much. I want to know about the spam block list, how does it works, how does trigger the action, and its logging mechanism. It would be great if some one could help me fix this bug.
Cheers, Anubhav
Anubhav Agarwal| 4rth Year | Computer Science & Engineering | IIT Roorkee _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org