here my story: i host about 10 websites with mediawiki. i noticed some anonymous guy from france spamming in Current Events with all the span crap hidden in a very small div. My reaction was to delete everything and install a spamblocker (no more anonymous login / the extension etc...) All wikis were spammed and after doing some research myself in the net, i found a lot of other beginners wiki spammed too.
The ranking in google dropped from 4 to 0 within one week, even having removed all the crap.
Now i realized that google also look in all the versions and deleted pages !!!!!
So here the question: How can i remove safely all spam in the database without leaving any link to a version. Can i just delete the blob in mysql?
Thanks for helping.
unfortunately all wikis are going to be abused that way, even reacting immediately the ranking drops because of the versions!!!
Thanks for any hint Andres
On 12/30/05, Andres Obrero andres@holzapfel.ch wrote:
So here the question: How can i remove safely all spam in the database without leaving any link to a version. Can i just delete the blob in mysql?
I don't know of a way to * bulk-remove edits by a particular user * remove certain content
However, to block future spam like this, I recommend that you edit your LocalSettings.php and add either: $wgSpamRegex="/<div/"; or if you need to use <div> tags, try: $wgSpamRegex="/overflow:\s*auto/";
I also recommend the SpamBlacklist extension (although I couldn't get it to work 100%, it still works fairly well for me) http://meta.wikimedia.org/wiki/SpamBlacklist_extension
Andres Obrero wrote:
Now i realized that google also look in all the versions and deleted pages !!!!!
No it doesn't. To begin with, old versions are specifically marked for spiders not to index them. Deleted pages aren't accessible to an outside spider at all.
If your robots.txt is not set up properly to keep the robots from visiting the pages, you should also set that up though that's just to keep useless load from hitting the server.
-- brion vibber (brion @ pobox.com)
On 12/30/05, Brion Vibber brion@pobox.com wrote: (snip)
To begin with, old versions are specifically marked for spiders not to index them. Deleted pages aren't accessible to an outside spider at all.
If your robots.txt is not set up properly to keep the robots from visiting the pages, you should also set that up though that's just to keep useless load from hitting the server.
So does this mean that I either /must/ or /don't need to/ tweak my own robots.txt to ensure that robots don't crawl history?
I had heard that there are the proper meta tag (or whatever) to tell spiders not to delve into revisions.. where can I learn more about this issue?
I visited the meta page, but it doesn't go into detail.. http://meta.wikimedia.org/wiki/Robots.txt
mediawiki-l@lists.wikimedia.org