On Fri, 24 May 2013, Daniel Friesen wrote:
.. The proper way to deal with this spam is not by IP but by content. We need some people who are knowledgeable about matching spam by training programs with spam and non-spam. ..
Well, Daniel. I have some ideas how to realize the automatic analysis of the content of articles and qualify some of them as spam. They are based on the TORI axioms, and I am not sure if this is correct place to describe them. Better, I would try to do it by myself, but yet I have no experience programming in PHP and writing robots. (My best achievements: I wrote few PHP scripts and once I killed a hundred of users through MySQL with a single command; and I am not sure if the intelligent robot should use such a brutal way.) In order to participate in the project, I need certain help from the professionals. Namely, I need somebody to post the detailed tutorial, description of the basic "plug-in" and "plug-out", with very simple examples: 1. Code opens the wiki, downloads the list of new files and saves the list as a text file in the working directory. 2. Code open the specific page for editing and saves its source in the working directory. 3. Code opens the editing of the specific page and replaces its content with the special source from the working directory. 4. Code opens the editing of the specific discussion page and add there the warning. 5. Code blocks the specific user. 6. Code removes the specific page. 7. Code collects all the complains about its activity and transfers them to the Human–administrator. 8. Code makes the google search and saves the results as the text file. The spammers already have these examples; it would be good to supply with the same tools the colleagues, who handle some wikis. The samples mentioned above should be short; preferably, one line each. They should be optimized not for the best performance, but for the easiest understanding by a human. In particular, neither loops, not complicated logical expressions should be involved. The rest I plan to write in C++, which seems to be faster than PHP; and (which is more important) I am more familiar with C++ than with PHP. The goal is robot–admin, robot-editor, that would not be distinguishable from an intelligent professional human, that follows the explicitly formulated and transparent editorial politics. If success, you'll be able to rewrite it from C++ to PHP and optimize for MediaWiki.
Also, it would be good to arrange the option, that the new page, by default, opens with sertain content form the sample page (for example, http://mizugadro.mydns.jp/o/index.php/SamplePage ) that helps the human to provide the necessary elements of a good article: preamble, introduction, definition(s), description of the new concept(s), support of the concept suggested, critics of the concept suggested, ways of refutation of the concept suggested, humor about the concept, conclusion, references, keywords, categories. Then, any article, that fail the elements above, should qualified as spam and treated correspondently.