Eliminating 'unsuccessful search' and 'special' pages from the count
gives the following stats:
Analysing 100,000 lines from the raw log with this filtering gives:
bin in seconds, total pages, cumulative percentage
0 57360 83.443651%
1 6929 93.523516%
2 2028 96.473720%
3 1034 97.977917%
4 640 98.908948%
5 314 99.365735%
6 157 99.594129%
7 81 99.711962%
8 61 99.800701%
9 46 99.867619%
10 18 99.893804%
11 12 99.911261%
12 16 99.934537%
13 13 99.953448%
14 6 99.962177%
15 6 99.970905%
16 6 99.979634%
17 2 99.982543%
18 0 99.982543%
19 3 99.986907%
20 2 99.989817%
summary 68741 hits in 41366.343 secs, avg = 0.601771039118
only 9 non-special pages took over 20 seconds: here they are:
20020713011714 28.783 /wiki/Historical_anniversaries
20020713012523 20.301 /wiki/Sport
20020713014205 23.161 /wiki/Federal_Standard_1037C
20020713014723 25.357
/w/wiki.phtml?title=Free_On-line_Dictionary_of_Computing/O
_-_Q&redirect=no
20020713015936 21.513
/w/wiki.phtml?title=Wikipedia:Bug_reports&action=history
20020713022203 25.252 /wiki/Free_On-line_Dictionary_of_Computing/L_-_N
20020713025105 29.975
/w/wiki.phtml?title=Free_On-line_Dictionary_of_Computing/E_-_H&redirect=no
20020713033140 20.802 /wiki/Feature_requests
20020713043401 41.392
/w/wiki.phtml?title=Complete_list_of_encyclopedia_topics/R&diff=78830&oldid=71983
It's interesting to note that random spidering hits 'special' pages
about 30% of the time.
Where the page accesses have been binned by the integer part of their
service time as recorded in the logs.
This is looking really good.
-------------------------------------------------
SUGGESTION #1:
Looking at the logs suggests that many of the worst results are
generated on the special page options with large counts -- particularly
the versions with count=5000.
Here's my proposal: we should not list the options with count > 500 for
users *that are not logged in*.
So, at the bottom of the orphans page, a logged in user would see
View (previous 50) (next 50) (20 | 50 | 100 | 250 | 500 | 1000 | 2500
| 5000).
and an casual browser (and any busy bots or spiders) would see
View (previous 50) (next 50) (20 | 50 | 100 | 250 | 500 ).
Random selection from the first list will search on average
50+50+20+50+100+250+500+1000+2500+5000 / 10 = 952 pages
Random selection from the second list will search on average
50+50+20+50+100+250+500 / 10 = 102 pages
a reduction in load of almost an order of magnitude.
Removing these big outlier loads may well take some of the strain off
ordinary page loads that happen to occur at the same time.
------------------------------------------------
SUGGESTION #2:
The 'Unsuccessful search' pages can be enormous. They accumulate all the
bad searches in a whole month. As Wikipedia becomes more popular, they
have become huge, and they now take a long time to load. We should make
these weekly or daily instead of monthly, and perhaps split up the old
ones using a script.
This will also have the effect of improving the 'most wanted' rating of
frequently missed searches, as currently only one instance a month counts.
Or perhaps they should be generated as a special page from the database?
---------------------------------------------------
Neil