question about performance

List overview All Threads
Download

newer

older

Stress testing

Re: testing new server

Tomasz Wegrzanowski

11 Jul 2002 11 Jul '02

11 a.m.

Wikipedia is often extremely slow. What's the bottleneck ? * network i/o ? * database performance ? * wikipedia script performance ? * something else ?

Show replies by date

Neil Harris

11 Jul 11 Jul

12:21 p.m.

Tomasz Wegrzanowski wrote:

...

Wikipedia is often extremely slow. What's the bottleneck ? * network i/o ? * database performance ? * wikipedia script performance ? * something else ?

No-one really knows. Another possibility would be * dodgy hardware Perhaps a disk, or a network switch, is intermittently hanging. One interesting observation is that the even when the English-language Wikipedia is jammed up, the international ones, which I believe run on the same server, are often working OK. This suggests a software, not a hardware, problem. The new software is currently under test, and it's being filled up with test articles, and exercised using test scripts. Network I/O is not a big bottleneck: the new server has been clocked doing 2.8 hits/sec sustained, and that was limited by the 512k bandwidth of the ADSL link to the testing machine. Further tests are being carried out, and I hope to have some results soon. Assuming Bomis has 10 Mbit/s bandwidth available, such a link could support 20 x 2.8 = 56 hits/sec, or 4.8 million hits/day assuming evenly distributed load. Database performance is likely to be contstrained by two things: * locking * disk I/O Locking is a problem because it serializes accesses, reducing opportunities for parallel processing, and creating bottlenecks on the locked resources. Locking can be made better by: * locking for as short a time as possible * locking with the finest grain possible * using a database which supports concurrent transactions with reduced locking Disk I/O can be made faster by * using disks which spin fast (rotational latency is reduced), and * putting them in a big RAID with lots of spindles and a high-speed attachment * using an operating system which multi-threads I/O properly Wikipedia script performance is unlikely to be the bottleneck. We now have the opportunity to load the test system heavily and measure CPU load, to be able to estimate this factor accurately. Something else could be: * Memory hogging This is a little-known nasty factor in server programming. Here, the problem is worker threads being tied up by slow or malfunctioning clients, such as those on modems, or with high packet loss, or both. Say a worker thread consumes W Mbytes of store, and an access transfers 50k bytes (400 kbits) of data. Then a really slow link at say 20 kbps will take 20 seconds to download this page. In doing so, it locks W Mbytes in store for that entire time. If we have X megabytes of store, and slow clients are the dominating factor, then we can only accomodate X/W concurrent workers, serving (1/20)*X/W pages per second. For X = 256, W=2, that's 6.4 hits per second. Therefore, a server needs to have lots of RAM to prevent slow clients from blocking it. Hmm... increasing the OS socket buffer size to > 50k might be a win here. Fortunately, the new server has lots of RAM. * Swapping Once you are doing VM swapping on a webserver or database, performance plummets. Memory leaks somewhere could be bloating processes, causing the server to swap. * Congestion collapse Whatever the mechanism for going slow, once a system is overloaded it can enter a state known as 'congestion collapse', where it's slow because it's overloaded, and overloaded because it's slow. This is made worse if users keep on retrying their requests. A system may take some time to recover from congestion collapse, and after recovery it will be back to normal as if the collapse had never happened. This resembles the current state of affairs. * Cracking Someone may have cracked the server, and be using it for malicious purposes. Yow! I've just checked, and wikipedia.com is still running Apache/1.3.23 (Unix) PHP/4.0.6 mod_fastcgi/2.2.12, which may well be vulnerable to the chunked-transfer-encoding bug unless it's been patched. Beta.wikipedia.com is running Apache/1.3.26 (Unix) PHP/4.2.1, so it's not vulnerable to this bug. * Denial-of-service attacks SYN floods, that sort of thing. Well, that's a list of the first few things off the top of my head. Neil

snburk＠t-online.de

1:27 p.m.

Hi all, after watching various wikipedia mailing lists silently for a while, it might be time for my first two cents. Neil Harris wrote:

...

One interesting observation is that the even when the English-language

Wikipedia is jammed up,

...

the international ones, which I believe run on the same server, are often

working OK I have noticed that, too, but also the other way round, that e.g. the German Wikipedia is jamming while the English version is still working. Since I have often been looking at pages from both the German version (de), the German test version (test-de) and the English version, I think I have seen all possible combinations of 3 working or jamming wikis. So I don't think the English version is something special regarding this problem. But usually when one is on strike, the others are following soon within a few minutes. Maybe they're all in the same labor union ... ;-) Sven (Ben-Zin) Btw ... has anyone noticed yet that the automatically added address Wikitech-l(a)ross.bomis.com at the bottom doesn't seem to work? Mailserver told me something about "User unknown". Maybe someone can update this address.

Tomasz Wegrzanowski

3:48 p.m.

On Thu, Jul 11, 2002 at 05:21:40PM +0100, Neil Harris wrote:

...

One interesting observation is that the even when the English-language Wikipedia is jammed up, the international ones, which I believe run on the same server, are often working OK. This suggests a software, not a hardware, problem.

I've seen both English-down-others-working, and all-down.

...

Database performance is likely to be contstrained by two things: * locking * disk I/O Locking is a problem because it serializes accesses, reducing opportunities for parallel processing, and creating bottlenecks on the locked resources. Locking can be made better by: * locking for as short a time as possible * locking with the finest grain possible * using a database which supports concurrent transactions with reduced locking

What about switching to Postgres ? It is said to have better locking.

...

Disk I/O can be made faster by * using disks which spin fast (rotational latency is reduced), and * putting them in a big RAID with lots of spindles and a high-speed attachment * using an operating system which multi-threads I/O properly Wikipedia script performance is unlikely to be the bottleneck. We now have the opportunity to load the test system heavily and measure CPU load, to be able to estimate this factor accurately.

Even if is far from consuming 100% CPU, if it's slow, it occupies memory for longer time. Or it may simply be using too much memory per thread.

...

Something else could be: * Memory hogging This is a little-known nasty factor in server programming. Here, the problem is worker threads being tied up by slow or malfunctioning clients, such as those on modems, or with high packet loss, or both. Say a worker thread consumes W Mbytes of store, and an access transfers 50k bytes (400 kbits) of data. Then a really slow link at say 20 kbps will take 20 seconds to download this page. In doing so, it locks W Mbytes in store for that entire time. If we have X megabytes of store, and slow clients are the dominating factor, then we can only accomodate X/W concurrent workers, serving (1/20)*X/W pages per second. For X = 256, W=2, that's 6.4 hits per second. Therefore, a server needs to have lots of RAM to prevent slow clients from blocking it. Hmm... increasing the OS socket buffer size to > 50k might be a win here. Fortunately, the new server has lots of RAM.

2 megabytes of non-shared memory per thread ? That would be enormous. What's the real value like ? Also if the thread is up it may be unnecessarily holding database connection. But it's not likely to be major problem.

...

* Swapping Once you are doing VM swapping on a webserver or database, performance plummets. Memory leaks somewhere could be bloating processes, causing the server to swap.

Swaping isn't a problem, it's a symptom. Heavy apache or mysql bloat is unlikely and Wikipedia threads live too short to have chance of bloating too much.

Neil Harris

3:52 p.m.

Tomasz Wegrzanowski wrote:

...

On Thu, Jul 11, 2002 at 05:21:40PM +0100, Neil Harris wrote:

[snip]

...

Locking is a problem because it serializes accesses, reducing opportunities for parallel processing, and creating bottlenecks on the locked resources. Locking can be made better by: * locking for as short a time as possible * locking with the finest grain possible * using a database which supports concurrent transactions with reduced locking

What about switching to Postgres ? It is said to have better locking.

Indeed so. I was going to get to that later. ;-)

...

Even if is far from consuming 100% CPU, if it's slow, it occupies memory for longer time. Or it may simply be using too much memory per thread.

Think in terms of a C stack, and a complete set of kernel data structures, sockets etc. OK, maybe more like 1 Mbyte+ Of course, a lightweight thread takes significantly less resources.

...

* Swapping Once you are doing VM swapping on a webserver or database, performance plummets. Memory leaks somewhere could be bloating processes, causing the server to swap.

Swaping isn't a problem, it's a symptom. Heavy apache or mysql bloat is unlikely and Wikipedia threads live too short to have chance of bloating too much.

True. But something weird's got to be happening. Maybe the different national 'pedias are too big to all fit in the working set at once.

...

Neil Harris

12:23 p.m.

New subject: Security alert - old version of Apache runnung on main server

I've just checked, and wikipedia.com is still running Apache/1.3.23 (Unix) PHP/4.0.6 mod_fastcgi/2.2.12, which may well be vulnerable to the chunked-transfer-encoding bug unless it's been patched. Beta.wikipedia.com is running Apache/1.3.26 (Unix) PHP/4.2.1, so it's not vulnerable to this bug.

7965

days inactive

7965

days old

wikitech-l@lists.wikimedia.org

Manage subscription

5 comments

4 participants

tags (0)

participants (4)

Neil Harris
Neil Harris
snburk＠t-online.de
Tomasz Wegrzanowski