Re: [WikiEN-l] Harassment sites

27 Oct 2007

Andrew Gray wrote:
...
  Mmm. We get quite a few of these (and if we're
unlucky, it's squatted
 by a pornsite). Is there any practical way of spidering through our
 links to check for these?

Interesting question. I could think of two ways.

One would be to take a large sample of domain names, check them before 
and after expiration, and develop some sort of fingerprint for the 
squatters. E.g., IP hosting blocks, DNS servers, WHOIS records, page 
content, page links, or server info.

The other would be to crawl all our external links and check for 
significant changes in the pages after WHOIS changes (or perhaps major 
nameserver changes if we can't find a source for bulk WHOIS queries. I 
think we could get at significance by using our article pages to 
recognize important words or word frequency patterns on the linked pages 
and noting significant deviations.

The lamer version would just be to make a list of links to domains that 
appear to have changed hands recently. That'd have a higher error rate, 
but would be pretty easy to build.

William

-- 
William Pietri &lt;william(a)scissor.com&gt;
http://en.wikipedia.org/wiki/User:William_Pietri

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [WikiEN-l] Harassment sites