Are you good in swearing? WE NEED YOU
Huggle 3 comes with vandalism-prediction as it is precaching the diffs even before they are enqueued including their contents. Each edit has so called "score" which is a numerical value that if higher, the edit is more likely a vandalism.
If you want to help us improve this feature, it is necessary to define a "score words" list for every wiki where huggle is about to be used, for example on English wiki.
Each list has following syntax:
(see https://en.wikipedia.org/w/index.php?title=Wikipedia:Huggle/Config&diff=...)
score-words(score): list of words separated by comma, can contain newlines but comma must be present
example
score-words(200): these, are, some, words, which, presence, of, increases, the, score, each, word, by, 200,
So, if you know english better than me, which you likely do, go ahead and improve the configuration file there, no worries, huggle's config parser is very syntax-error proof.
If you have any other suggestion how to improve huggle's prediction, go ahead and tell us!
Perhaps we could use some Math here? Can we grab a list of the last, say, 100,000 edits reverted for vandalism, look at the diff, and compute a frequency score based on that? --scott
On Thu, Sep 19, 2013 at 7:19 AM, C. Scott Ananian cananian@wikimedia.orgwrote:
Perhaps we could use some Math here? Can we grab a list of the last, say, 100,000 edits reverted for vandalism, look at the diff, and compute a frequency score based on that? --scott
This is pretty much what my gsoc student implemented in the bayesian filter extension. If that gets some use, then those lists could easily be fed back.
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Sep 19, 2013 at 11:19 AM, C. Scott Ananian cananian@wikimedia.org wrote:
Perhaps we could use some Math here? Can we grab a list of the last, say, 100,000 edits reverted for vandalism, look at the diff, and compute a frequency score based on that? --scott
I did something like that in JavaScript: https://github.com/he7d3r/mw-gadget-WordFrequencyOnRevertedEdits/
Helder
Le 19/09/13 11:35, Petr Bena a écrit : <snip>
Huggle 3 comes with vandalism-prediction as it is precaching the diffs even before they are enqueued including their contents. Each edit has so called "score" which is a numerical value that if higher, the edit is more likely a vandalism.
If you want to help us improve this feature, it is necessary to define a "score words" list for every wiki where huggle is about to be used, for example on English wiki.
Each list has following syntax:
(see https://en.wikipedia.org/w/index.php?title=Wikipedia:Huggle/Config&diff=...)
The good thing while reinventing the wheel, is that you can reuse existing material :-]
Cluebot-NG has such a list: http://review.cluebot.cluenet.org and its a quite active one: http://en.wikipedia.org/wiki/Special:Contributions/ClueBot_NG
It uses a variety of algorithms to determine the score of an edit: http://en.wikipedia.org/wiki/User:ClueBot_NG#Vandalism_Detection_Algorithm
Maybe get in touch with them and reuse their engine?
Hi, cool, I was actually expecting someone to come out with suggestions like this. Indeed I didn't know that and now I do. In fact closer cooperation with cluebot is on TO-DO :-) any good algorithm to calculate vandalism is appreciated, in fact this might be the first thing we should create hooks for, so that people can implement own algorithms as either c++ or python plugins which count the score just as they like... (unfortunately I didn't manage to get python engine working for windows build yet)
On Thu, Sep 19, 2013 at 4:47 PM, Antoine Musso hashar+wmf@free.fr wrote:
Le 19/09/13 11:35, Petr Bena a écrit :
<snip> > Huggle 3 comes with vandalism-prediction as it is precaching the diffs > even before they are enqueued including their contents. Each edit has > so called "score" which is a numerical value that if higher, the edit > is more likely a vandalism. > > If you want to help us improve this feature, it is necessary to define > a "score words" list for every wiki where huggle is about to be used, > for example on English wiki. > > Each list has following syntax: > > (see https://en.wikipedia.org/w/index.php?title=Wikipedia:Huggle/Config&diff=573615259&oldid=573615075)
The good thing while reinventing the wheel, is that you can reuse existing material :-]
Cluebot-NG has such a list: http://review.cluebot.cluenet.org and its a quite active one: http://en.wikipedia.org/wiki/Special:Contributions/ClueBot_NG
It uses a variety of algorithms to determine the score of an edit: http://en.wikipedia.org/wiki/User:ClueBot_NG#Vandalism_Detection_Algorithm
Maybe get in touch with them and reuse their engine?
-- Antoine "hashar" Musso
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Sep 19, 2013 at 2:35 AM, Petr Bena benapetr@gmail.com wrote:
Are you good in swearing? WE NEED YOU
I know 7 words you can add ;-)
[[w:Seven dirty words]]
-Chad
On 19/09/13 10:35, Petr Bena wrote:
Are you good in swearing? WE NEED YOU
Huggle 3 comes with vandalism-prediction as it is precaching the diffs even before they are enqueued including their contents. Each edit has so called "score" which is a numerical value that if higher, the edit is more likely a vandalism.
If you want to help us improve this feature, it is necessary to define a "score words" list for every wiki where huggle is about to be used, for example on English wiki.
Each list has following syntax:
(see https://en.wikipedia.org/w/index.php?title=Wikipedia:Huggle/Config&diff=...)
score-words(score): list of words separated by comma, can contain newlines but comma must be present
example
score-words(200): these, are, some, words, which, presence, of, increases, the, score, each, word, by, 200,
[[en:User:/DeltaQuad/UAA/Blacklist]] contains a fairly comprehensive overview of English-language profanity and general trash-talk formatted as regexps, mixed in with other non-sweary blocking patterns that are specific to that blacklist's needs.
Neil
About swears in English language, sorry I can't help but I'm very good at Persian :D, We have an abuse filter about Persian swears which is hidden from public https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%98%D9%87:%D9%BE%D8%A7%D9%84%D8...
And It works pretty good, So If you need to i18n huggle, this page will be a good help
Best
On Thu, Sep 19, 2013 at 8:59 PM, Neil Harris neil@tonal.clara.co.uk wrote:
On 19/09/13 10:35, Petr Bena wrote:
Are you good in swearing? WE NEED YOU
Huggle 3 comes with vandalism-prediction as it is precaching the diffs even before they are enqueued including their contents. Each edit has so called "score" which is a numerical value that if higher, the edit is more likely a vandalism.
If you want to help us improve this feature, it is necessary to define a "score words" list for every wiki where huggle is about to be used, for example on English wiki.
Each list has following syntax:
(see https://en.wikipedia.org/w/**index.php?title=Wikipedia:** Huggle/Config&diff=573615259&**oldid=573615075https://en.wikipedia.org/w/index.php?title=Wikipedia:Huggle/Config&diff=573615259&oldid=573615075 )
score-words(score): list of words separated by comma, can contain newlines but comma must be present
example
score-words(200): these, are, some, words, which, presence, of, increases, the, score, each, word, by, 200,
[[en:User:/DeltaQuad/UAA/**Blacklist]] contains a fairly comprehensive overview of English-language profanity and general trash-talk formatted as regexps, mixed in with other non-sweary blocking patterns that are specific to that blacklist's needs.
Neil
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org