On Jan 9, 2008 12:49 PM, Noah Salzman nds@salzman.net wrote:
This area is ripe for exploration. Has anyone looked into "Summer of Code" type projects for this sort of thing? The signatures for the great majority of vandalism are not difficult to understand.
But difficult to obtain without flooding. As a developer of two vandalfighting tools (one still unreleased) I can tell you that the most difficult part of developing such a tool is not the AI, but having it be efficient with respect to its network usage. You can't go and download five diffs every time you see an edit on browne, especially not when you're coding it into a tool meant to be used by many users. The www servers would probably choke. (I know there is quite a caching server farm, but to my knowledge diff pages are not so cached, and I don't think anything is cached for logged-in users.)
Then there's the fact that diffs aren't even available in an easily-parsable format. We have to download a page full of HTML and rip it apart. Show me a developer that *wants* to code to that spec.
What we need is a MediaWiki query API for obtaining the unformatted diff of a revision, with the ability to specify multiple requests at once. Even then we are talking about quite a bit of traffic (especially if the system is run by many users) but far less and in a format much better suited to be analyzed.
Really once we have some easy and efficient way to get diffs, it's just a matter of forking spamassassin and writing some quality rules. :)