> This is fine, but you are trying a new car, and I want you to mount a
> speedometer so we can monitor the speed while driving, continuously
> keeping an eye on the problem instead of making believe we can fix it
> once and for all.
I agree that we need to have more clear data about performance numbers, and
that we need to monitor that data over time. My main point was to say that
1) we've got a faster and more adaptable code base in the wings, and 2) this
will give us time to do performance monitoring right, as well as to think
clearly about which things to optimize. As I said before, I think everybody
is trying to solve this problem, and that the developers are clearly moving
in the right direction. I'm no performance expert, but as a jack of all
trades and master of none, I'd like to toss out a few thoughts.
I'm concerned that we don't try to gather that data in a way which does
creates performance problems.
My experience is that it is a bad idea to make every function append
information to log file. I've never seen someone append to a log file on
every function call without a noticeable effect on performance, at least for
well designed ASP or JSP sites which have already minimized disk I/O.
I assume the same would be true with PHP. However, I having just read
Jimbo's e-mail, and it could be the I/O performance of the underling system
we were using -- mostly WinNT on slightly older machines -- that was to
blame.
That said, I'm aware of three common ways to get around this performance
bottleneck.
Most commonly, people just turn up the detain on their web server reporting
and parse those logs for the performance information they can get from
there. The problem with this is that it is sometime fails to provide the
fine grained information we'd need to see exactly which pages are causing
the slowdown.
Second, I've seen a lot of people storing log information in a database.
This is probably what we should do if we need more precise information than
we can get from the server logs.
The third solution is the most flexible, which is to set up a thread which
asynchronously processes a queue of log data and occasionally writes the
information to disk. This is clearly the most complex to implement, and I
wouldn't even consider it unless we decide we need realtime performance
monitoring, or there's something built into PHP to do this.
Regardless of how we choose to get this data, we can and should still be
thinking about how to optimize those functions which the server spends a LOT
of time doing, or those things which take a LONG time, as wikipedia usage
could very well continue to scale up dramatically, and we need to be ready
for this ahead of time. On the other hand, there is a cost to optimizations,
if only in the added complexity of the code. Complex code is more difficult
to update and maintain, and it is less likely to attract new developers over
time, so we need to make considered choices about what functions to 1) leave
unutilized, 2) optimize, or 3) remove altogether.
> In the last week, my script made 410 attempts at 20 minute
> intervals to reach the page http://www.wikipedia.com/wiki/Chemistry
> Out of these, only 86% were served in less than 5 seconds.
> Five percent of the calls timed out (my limit is 60 seconds).
> Now, this is far better than the worst problems that Wikipedia
> saw in April or May, but it is still pretty miserable.
I believe the performance in the new code is much improved.
Even under all the load I can put on it with a T1 line, the beta software is
producing average load times of less than 1.5 seconds.
A number of people are now working on stress testing this software before it
is put into production. And I think there is a general commitment to
solving the performance problem, and I see lots of movement in the right
direction.
Eliminating 'unsuccessful search' and 'special' pages from the count
gives the following stats:
Analysing 100,000 lines from the raw log with this filtering gives:
bin in seconds, total pages, cumulative percentage
0 57360 83.443651%
1 6929 93.523516%
2 2028 96.473720%
3 1034 97.977917%
4 640 98.908948%
5 314 99.365735%
6 157 99.594129%
7 81 99.711962%
8 61 99.800701%
9 46 99.867619%
10 18 99.893804%
11 12 99.911261%
12 16 99.934537%
13 13 99.953448%
14 6 99.962177%
15 6 99.970905%
16 6 99.979634%
17 2 99.982543%
18 0 99.982543%
19 3 99.986907%
20 2 99.989817%
summary 68741 hits in 41366.343 secs, avg = 0.601771039118
only 9 non-special pages took over 20 seconds: here they are:
20020713011714 28.783 /wiki/Historical_anniversaries
20020713012523 20.301 /wiki/Sport
20020713014205 23.161 /wiki/Federal_Standard_1037C
20020713014723 25.357
/w/wiki.phtml?title=Free_On-line_Dictionary_of_Computing/O
_-_Q&redirect=no
20020713015936 21.513
/w/wiki.phtml?title=Wikipedia:Bug_reports&action=history
20020713022203 25.252 /wiki/Free_On-line_Dictionary_of_Computing/L_-_N
20020713025105 29.975
/w/wiki.phtml?title=Free_On-line_Dictionary_of_Computing/E_-_H&redirect=no
20020713033140 20.802 /wiki/Feature_requests
20020713043401 41.392
/w/wiki.phtml?title=Complete_list_of_encyclopedia_topics/R&diff=78830&oldid=71983
It's interesting to note that random spidering hits 'special' pages
about 30% of the time.
Where the page accesses have been binned by the integer part of their
service time as recorded in the logs.
This is looking really good.
-------------------------------------------------
SUGGESTION #1:
Looking at the logs suggests that many of the worst results are
generated on the special page options with large counts -- particularly
the versions with count=5000.
Here's my proposal: we should not list the options with count > 500 for
users *that are not logged in*.
So, at the bottom of the orphans page, a logged in user would see
View (previous 50) (next 50) (20 | 50 | 100 | 250 | 500 | 1000 | 2500
| 5000).
and an casual browser (and any busy bots or spiders) would see
View (previous 50) (next 50) (20 | 50 | 100 | 250 | 500 ).
Random selection from the first list will search on average
50+50+20+50+100+250+500+1000+2500+5000 / 10 = 952 pages
Random selection from the second list will search on average
50+50+20+50+100+250+500 / 10 = 102 pages
a reduction in load of almost an order of magnitude.
Removing these big outlier loads may well take some of the strain off
ordinary page loads that happen to occur at the same time.
------------------------------------------------
SUGGESTION #2:
The 'Unsuccessful search' pages can be enormous. They accumulate all the
bad searches in a whole month. As Wikipedia becomes more popular, they
have become huge, and they now take a long time to load. We should make
these weekly or daily instead of monthly, and perhaps split up the old
ones using a script.
This will also have the effect of improving the 'most wanted' rating of
frequently missed searches, as currently only one instance a month counts.
Or perhaps they should be generated as a special page from the database?
---------------------------------------------------
Neil
I used the nice hammerhead tool (http://hammerhead.sourceforge.net) to
stress test the beta.wikipedia.com server. The tool lets you simulate
several simultaneous users repeatedly requesting pages from (or
posting to) the server.
I have my users request RecentChanges (33% of the time) and issue
searches (66% of the time). Here's the average response time of the
server:
1 user: 2 sec
5 users: 4 sec
10 users: 8 sec
20 users: 11 sec
100 users: 42 sec
These are only the times it took the server to respond; the actual
total time to complete a request is not as useful in my case because
of my limited bandwidth.
Axel
The apache server allows for customized server log messsages
(http://httpd.apache.org/docs/logs.html#accesslog). I think we should
include the directive %T, which reports the time it took to serve a
request. That way, we could process the server logs to pinpoint the
precise conditions which cause requests to take a long time.
Adding the line
LogFormat "%h %l %u %t \"%r\" %>s %b %T" custom
to httpd.conf and changing
CustomLog /usr/local/apache/logs/access_log common
to
CustomLog /usr/local/apache/logs/access_log custom
should do the job.
Axel
On Sat, 13 Jul 2002, Daniel Mayer wrote:
> Much is now being done to remedy performance problems -- so I do believe what
> you said is needlessly rude (even if there is a grain of truth to it). This
> is an issue that has crept up upon the developers as new features were added
> -- many of which were asked for by the users.
If performance issues creep upon you, then you have not designed your
system for performance measurement and monitoring. This means you are
clueless. It is like driving a car without a speedometer, and being all
surprised when you are caught for speeding.
In the last week, my script made 410 attempts at 20 minute intervals to
reach the page http://www.wikipedia.com/wiki/Chemistry
Out of these, only 86% were served in less than 5 seconds. Five percent
of the calls timed out (my limit is 60 seconds). Now, this is far better
than the worst problems that Wikipedia saw in April or May, but it is
still pretty miserable. The non-English Wikipedias feature very similar
numbers.
The Sevilla project (http://enciclopedia.us.es/) serves 96% of all my
attempts in under 2 seconds, and 99% in under five seconds. This should
probably be attributed to luck rather than skill, but it helps move people
from the Spanish Wikipedia over to the breakout project.
> Software development seems to often work a lot like article development --
That's OK, but just like the basic Wiki software defines the concept of an
article (it can be written, reviewed, its history tracked, modified,
removed), the software should define a framework for new functionality
that can measure its impact on performance, and turn it on or off. Think
modules.
--
Lars Aronsson (lars(a)aronsson.se)
tel +46-70-7891609
http://aronsson.se/http://elektrosmog.nu/http://susning.nu/
I thought I'd bring up this idea again since it might be easy to
implement with the new codebase.
If you put text such as [$\int_{x=0}^\infty x^2 dx$] in Wiki, upon
saving the article, TeX will be called and translate the formula into
an image, and store the image on the server and its name in a database
indexed with the formula text. When the Wiki page is presented, the
image is inlined (and an alt attribute containg the formula text
added). When the page is later edited and saved again, the system
first checks whether an up-to-date image of the formula already
exists; if not, TeX is called to regenerate it.
This would make mathematicians, computer scientists, physicists and
chemists happy. TeX includes a package for typesetting chemical
structure formulas and another one for quite general labeled diagrams
and trees. There's also a TeX package which allows to typeset musical
notes and another one for chess positions.
The concept could be expanded to other programs which can produce
graphics on-the-fly based on a textual description. This includes
gnuplot (graphs of functions) and maybe packages such as GD,
imagemagick or even GIMP.
Axel
The alphabet-soup filling of the dummy articles has added lots of junk
to the search indices, which is good.
Try doing a search for the three-letter string 'vyf'. It brings up lots
of articles containing 'vyf' in various combinations of upper and
lower-case.
But only one of them has a keyword-in-context display, which seems strange.
Neil
Neil,
maybe you could pepper your stress testing with some calls to special
functions; I would imagine that searching and RecentChanges are the
most important ones, but the more the better.
Axel