Thanks Tim for running those data. That seems to suggest the URL
structure works for the most case.
On Wed, Sep 18, 2013 at 12:07 AM, Tim Starling <tstarling(a)wikimedia.org> wrote:
On 17/09/13 13:59, Jon Robson wrote:
I would suggest taking a look at the number of
404s caused by people trying
to access pages without the wiki prefix.... This would be interesting data
to go alongside this interesting proposal...
There are lots of different sorts of 404s, so it's necessary to do
some filtering. For example:
* double-slashes, due to bug 52253
* sitemap.xml
* Apple touch icons
* bullet.gif in various directories
* vulnerability scanning, e.g. xmlrpc.php
* BlueCoat verify/notify, as described in
<http://www.webmasterworld.com/search_engine_spiders/3859463.htm>
* Serial numbers like
http://en.wikipedia.org/B008NAYASM .
I filtered out everything with a dot or slash in the prospective
article title, as well as the BlueCoat URLs and the UAs responsible
for serial number URLs. To simplify analysis, I took log lines from
the English Wikipedia only.
Most of the remaining log entries were search engine crawlers, so I
took those out too.
The result was 149 log entries at a 1/1000 sample rate, for the week
of September 8-14, implying a request rate of about 639,000 per month.
This is about 0.006% of the English Wikipedia's page view rate.
The 149 URLs are at
http://paste.tstarling.com/p/uhtFqg.html
-- Tim Starling
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l