On 17/09/13 13:59, Jon Robson wrote:
I would suggest taking a look at the number of 404s caused by people trying to access pages without the wiki prefix.... This would be interesting data to go alongside this interesting proposal...
There are lots of different sorts of 404s, so it's necessary to do some filtering. For example:
* double-slashes, due to bug 52253 * sitemap.xml * Apple touch icons * bullet.gif in various directories * vulnerability scanning, e.g. xmlrpc.php * BlueCoat verify/notify, as described in http://www.webmasterworld.com/search_engine_spiders/3859463.htm * Serial numbers like http://en.wikipedia.org/B008NAYASM .
I filtered out everything with a dot or slash in the prospective article title, as well as the BlueCoat URLs and the UAs responsible for serial number URLs. To simplify analysis, I took log lines from the English Wikipedia only.
Most of the remaining log entries were search engine crawlers, so I took those out too.
The result was 149 log entries at a 1/1000 sample rate, for the week of September 8-14, implying a request rate of about 639,000 per month. This is about 0.006% of the English Wikipedia's page view rate.
The 149 URLs are at http://paste.tstarling.com/p/uhtFqg.html
-- Tim Starling