Secondary servers

List overview All Threads
Download

newer

older

Problem with en?

RE: [Wikitech-l] The order

Brion Vibber

12 Jan 2004 12 Jan '04

8:11 p.m.

We really, really need to move the database back to a machine with a decent hard drive. The wikis are very sluggish, and a fair chunk of it's from waiting on the database.

Ursula's sitting around with a 90% idle CPU, but everything's blocked on disk I/O to the point it's got a load average of about 16. At any given time from 8-20 processes are blocked and waiting. Operations that hit a lot of rows like history and watchlist are particularly badly hit since they don't play as well with caching.

If Geoffrin's not going to be up soon, and Pliny's still emitting spurious disk errors, what are our options out of the available machines?

-- brion vibber (brion @ pobox.com)

Attachments:

PGP.sig (application/pgp-signature — 186 bytes)

Show replies by date

audin＠okb-1.org

12 Jan 12 Jan

8:22 p.m.

On Mon, Jan 12, 2004 at 11:11:28AM -0800, Brion Vibber wrote:

...

Ursula's sitting around with a 90% idle CPU, but everything's blocked on disk I/O to the point it's got a load average of about 16. At any given time from 8-20 processes are blocked and waiting. Operations that hit a lot of rows like history and watchlist are particularly badly hit since they don't play as well with caching.

I mentioned this before, but it may have gotten lost as the server fell over completely right afterwords...

Is DMA turned on and interrupt unmasking enabled on Ursula?

`/sbin/hdparm /dev/hda` (or whatever device it is) will show the current configuration.

/sbin/hdparm -u1 -c3 -d1 /dev/hda` will turn on DMA, turn off interrupt masking, and enable 32bit i/o support. Interrupt masking itself can have a huge impact on the amount of cpu time spent waiting for the disk...and it is almost always set conservatively (ie: wrong) on an untweaked linux installation.

-- Audin Malmin - audin@okb-1.org Every citizen should be a soldier. This was the case with the Greeks and Romans, and must be that of every free state. -- Thomas Jefferson

Brion Vibber

8:39 p.m.

On Jan 12, 2004, at 11:22, audin@okb-1.org wrote:

...

Is DMA turned on and interrupt unmasking enabled on Ursula?

DMA is on. Interrupt unmasking, no.

...

`/sbin/hdparm /dev/hda` (or whatever device it is) will show the current configuration.

# /sbin/hdparm /dev/hda

/dev/hda: multcount = 16 (on) IO_support = 0 (default 16-bit) unmaskirq = 0 (off) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 8 (on) geometry = 7476/255/63, sectors = 120103200, start = 0

...

/sbin/hdparm -u1 -c3 -d1 /dev/hda` will turn on DMA, turn off interrupt masking, and enable 32bit i/o support. Interrupt masking itself can have a huge impact on the amount of cpu time spent waiting for the disk...and it is almost always set conservatively (ie: wrong) on an untweaked linux installation.

Well, went ahead and turned on interrupt unmasking and 32-bit io. The numbers in vmstat don't look too much different yet, but we'll see.

-- brion vibber (brion @ pobox.com)

Camille Constans

9:03 p.m.

On Mon, 12 Jan 2004 11:39:59 -0800 Brion Vibber brion@pobox.com wrote:

...

Well, went ahead and turned on interrupt unmasking and 32-bit io. The numbers in vmstat don't look too much different yet, but we'll see.

-- brion vibber (brion @ pobox.com)

On my IBM 80 Go, 7200 trs/mn

Before : IO_support = 0 (default 16-bit)

...

hdparm -t /dev/hdd /dev/hdd: Timing buffered disk reads: 82 MB in 3.05 seconds = 26.89 MB/sec

After : IO_support = 1 (32-bit)

...

hdparm -t /dev/hdd /dev/hdd: Timing buffered disk reads: 74 MB in 3.03 seconds = 24.42 MB/sec

No changes seen....

Shaihulud

Audin Malmin

9:12 p.m.

On Mon, Jan 12, 2004 at 09:03:06PM +0100, Camille Constans wrote:

...

On my IBM 80 Go, 7200 trs/mn

Before : IO_support = 0 (default 16-bit)

...
Timing buffered disk reads: 82 MB in 3.05 seconds = 26.89 MB/sec

After : IO_support = 1 (32-bit)

...
Timing buffered disk reads: 74 MB in 3.03 seconds = 24.42 MB/sec

No changes seen....

Yes, I/O mode is odd. Sometimes it makes a significant difference, other times not. I suspect the deal is that, at the hardware level, it shouldn't matter, since it is a 16 bit connection to the drive in either case. So the cases where it makes a difference are instances where the 16-bit mode of the controller is flawed, but the 32-bit mode isn't.

DMA and interrupt masking almost always have pretty big impacts, though.

-- Audin Malmin - audin@okb-1.org Technology is a gift of God. After the gift of life it is perhaps the greatest of God's gifts. It is the mother of civilizations, of arts and of sciences. -- Freeman Dyson

Dave Caroline

8:25 p.m.

At 11:11 12/01/2004 -0800, you wrote:

...

We really, really need to move the database back to a machine with a decent hard drive. The wikis are very sluggish, and a fair chunk of it's from waiting on the database.

Ursula's sitting around with a 90% idle CPU, but everything's blocked on disk I/O to the point it's got a load average of about 16. At any given time from 8-20 processes are blocked and waiting. Operations that hit a lot of rows like history and watchlist are particularly badly hit since they don't play as well with caching.

If Geoffrin's not going to be up soon, and Pliny's still emitting spurious disk errors, what are our options out of the available machines?

For a small amount of money put a raid array in a machine to speed up access time and / or split the db accros a few boxes

...

-- brion vibber (brion @ pobox.com)

Dave Caroline aka archivist

...

Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Nick Reinking

8:25 p.m.

On Jan 12, 2004, at 1:11 PM, Brion Vibber wrote:

...

We really, really need to move the database back to a machine with a decent hard drive. The wikis are very sluggish, and a fair chunk of it's from waiting on the database.

Ursula's sitting around with a 90% idle CPU, but everything's blocked on disk I/O to the point it's got a load average of about 16. At any given time from 8-20 processes are blocked and waiting. Operations that hit a lot of rows like history and watchlist are particularly badly hit since they don't play as well with caching.

If Geoffrin's not going to be up soon, and Pliny's still emitting spurious disk errors, what are our options out of the available machines?

-- brion vibber (brion @ pobox.com) _______________________________________________ Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

It seems to me that one of the biggest problems we seem to have is reliable and speedy hard drives. Perhaps it might be wise to consider possible purchasing an external disk subsystem? Something like the Apple Xserve RAID systems are speedy (lots of internal hardware RAID), reliable (easy to swap out disks), and expandable (up to 3.5TB, 1TB in the cheapest configuration). If the database server dies, it should be fairly easy to plug the external disk system into another machine.

http://www.apple.com/xserve/raid/

Might be something worth considering...

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Ricky Beam

8:47 p.m.

On Mon, 12 Jan 2004, Nick Reinking wrote:

...

It seems to me that one of the biggest problems we seem to have is reliable and speedy hard drives. Perhaps it might be wise to consider possible purchasing an external disk subsystem? Something like the Apple Xserve RAID systems are speedy (lots of internal hardware RAID), reliable (easy to swap out disks), and expandable (up to 3.5TB, 1TB in the cheapest configuration). If the database server dies, it should be fairly easy to plug the external disk system into another machine.

"Like" is the key word there. 6k$ for a [censored] Apple logo is insane. I've bought similar hardware for less than 1/10th that price. If all you need is a drive shelf, start searching eBay. (I can recommendations if anyone cares.)

If you grab a fibre channel shelf (or more than one), I have plenty of drives for the cause (14x18G and 10x9.1G drives gathering dust.) I'd offer an entire Eurologic shelf, but you can find those local in CA easier and faster than shipping across the country. (6 of mine came from Canada :-))

--Ricky

Nick Reinking

9:25 p.m.

On Jan 12, 2004, at 1:47 PM, Ricky Beam wrote:

...

On Mon, 12 Jan 2004, Nick Reinking wrote:

...
It seems to me that one of the biggest problems we seem to have is reliable and speedy hard drives. Perhaps it might be wise to consider possible purchasing an external disk subsystem? Something like the Apple Xserve RAID systems are speedy (lots of internal hardware RAID), reliable (easy to swap out disks), and expandable (up to 3.5TB, 1TB in the cheapest configuration). If the database server dies, it should be fairly easy to plug the external disk system into another machine.

"Like" is the key word there. 6k$ for a [censored] Apple logo is insane. I've bought similar hardware for less than 1/10th that price. If all you need is a drive shelf, start searching eBay. (I can recommendations if anyone cares.)

If you grab a fibre channel shelf (or more than one), I have plenty of drives for the cause (14x18G and 10x9.1G drives gathering dust.) I'd offer an entire Eurologic shelf, but you can find those local in CA easier and faster than shipping across the country. (6 of mine came from Canada :-))

--Ricky

Hey, I'm just saying that we could certainly use a reliable external disk subsystem. If you know of somebody who sells a reliable external disk subsystem for a better price, than that's great. I'm guessing that you're going to have trouble building a rack mounted disk system, that supports fibre channel, redundant power supplies, disk hot-swapping, and a built-in RAID subsystem, with 1TB of disk space for less than $600, then let me know. I'd like to buy a few for myself.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Nick Hill

9:23 p.m.

This would be cured by moving the database to solid state memory. Mechanical media has a very long seek time and when many seeks are required, become unreliable.

Either: Run two MySQL instances, one starting after the finest grained database files have been copied to ramdisk. Replicate database to a hard disk file.

Or Install a solid state IDE disk. For example, a 4Gb Compact Flash card has an IDE interface built in as part of the specifications. Access time 0.1ms comapred to mechanical 8.5ms. 85x faster. CF to IDE cables are trivial and available.

To put it another way, you would need 85 mechanical drives to provide the seek performance of a solid state equivalent.

Brion Vibber wrote:

...

We really, really need to move the database back to a machine with a decent hard drive. The wikis are very sluggish, and a fair chunk of it's from waiting on the database.

Ursula's sitting around with a 90% idle CPU, but everything's blocked on disk I/O to the point it's got a load average of about 16. At any given time from 8-20 processes are blocked and waiting. Operations that hit a lot of rows like history and watchlist are particularly badly hit since they don't play as well with caching.

Nick Reinking

9:31 p.m.

On Jan 12, 2004, at 2:23 PM, Nick Hill wrote:

...

This would be cured by moving the database to solid state memory. Mechanical media has a very long seek time and when many seeks are required, become unreliable.

Either: Run two MySQL instances, one starting after the finest grained database files have been copied to ramdisk. Replicate database to a hard disk file.

I'm not sure how running two copies of MySQL on a machine is going to be faster than running one. Plus, I don't think MySQL can even work in this configuration.

...

Or Install a solid state IDE disk. For example, a 4Gb Compact Flash card has an IDE interface built in as part of the specifications. Access time 0.1ms comapred to mechanical 8.5ms. 85x faster. CF to IDE cables are trivial and available.

To put it another way, you would need 85 mechanical drives to provide the seek performance of a solid state equivalent.

Hmm... 4GB Compact Flash cards would be:

Too small (4GB is not enough space, we would need a couple hundred cards) Too expensive (4GB cards cost approximately about $1100, or around $330000 for enough space) Too slow (fast access, but only a 5MB/sec read/write) Limit read/writes (they would last about one week, and need to be tossed out)

So, no.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Alfio Puglisi

9:45 p.m.

On Mon, 12 Jan 2004, Nick Reinking wrote:

...

Hmm... 4GB Compact Flash cards would be:

Too small (4GB is not enough space, we would need a couple hundred cards)

Wait a sec, those two hundreds cards total 800GB. Are you sure that wikipedia needs that kind of storage? I thought that current articles are a few gigs, and history a few tens of gigs.

...

Too expensive (4GB cards cost approximately about $1100, or around $330000 for enough space) Too slow (fast access, but only a 5MB/sec read/write) Limit read/writes (they would last about one week, and need to be tossed out)

I agree, Flash cards are not OK for our needs. A server with lots of RAM is much better. Geoffrin would be really optimal, if it worked :(

Someone mentioned $6K for Apple raid storage, but that price is not so high: here at work we have some kind of raid NAS (half a terabyte I think) that went for about $4K. So yeah, NAS can be expensive, a simple internal SCSI raid like our current setup is enough, and with regular backups it can be transferred to another machine in case of need (heck, that's what Brion did multiple times in the last weeks if I am not mistaken, and that's enough proof that it works).

Alfio

Magnus Manske

10:06 p.m.

Alfio Puglisi wrote:

...

On Mon, 12 Jan 2004, Nick Reinking wrote:

...
Hmm... 4GB Compact Flash cards would be:

Too small (4GB is not enough space, we would need a couple hundred cards)

Wait a sec, those two hundreds cards total 800GB. Are you sure that wikipedia needs that kind of storage? I thought that current articles are a few gigs, and history a few tens of gigs.

...
Too expensive (4GB cards cost approximately about $1100, or around $330000 for enough space) Too slow (fast access, but only a 5MB/sec read/write) Limit read/writes (they would last about one week, and need to be tossed out)

I agree, Flash cards are not OK for our needs. A server with lots of RAM is much better. Geoffrin would be really optimal, if it worked :(

Here's a crazy thought: Is there a way to make a server identify itself as a *harddrive* on the SCSI bus? If so, we could take a machine with lots of RAM which does nothing else than holding most of the DB in cache (cheaper and faster than Compact Flash:-) and occasionally write stuff to its hard drives. Probably won't need much of a CPU, and maybe not even SCSI drives, just cheap IDE ones, as all it does is caching.

Then, we could plug that thing into the "real" DB server, which just sees a really fast HD.

Well, one can dream...

Magnus

Nick Reinking

10:22 p.m.

...

Here's a crazy thought: Is there a way to make a server identify itself as a *harddrive* on the SCSI bus? If so, we could take a machine with lots of RAM which does nothing else than holding most of the DB in cache (cheaper and faster than Compact Flash:-) and occasionally write stuff to its hard drives. Probably won't need much of a CPU, and maybe not even SCSI drives, just cheap IDE ones, as all it does is caching.

Then, we could plug that thing into the "real" DB server, which just sees a really fast HD.

Well, one can dream...

I think I remember reading about projects fairly similar to this many years ago. However, I don't think they went anywhere. I imagine people realized that if it costs $10k to build a generic machine that can handle these disks, and a huge number of hours to configure it and get it running, then you were probably better off buying a more reliable $15k external disk system.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Alfio Puglisi

11:10 p.m.

On Mon, 12 Jan 2004, Magnus Manske wrote:

...

Here's a crazy thought: Is there a way to make a server identify itself as a *harddrive* on the SCSI bus? If so, we could take a machine with lots of RAM which does nothing else than holding most of the DB in cache (cheaper and faster than Compact Flash:-) and occasionally write stuff to its hard drives. Probably won't need much of a CPU, and maybe not even SCSI drives, just cheap IDE ones, as all it does is caching.

Then, we could plug that thing into the "real" DB server, which just sees a really fast HD.

Well, one can dream...

Wouldn't the "harddrive" server be equivalent to the local RAM cache, only accessed via the SCSI bus? it seems to me that it would work in the same fashion - retrieve articles from the DB, cache them in RAM, and send them over the SCSI bus - exactly what the actual server is doing. You're just adding the RAM caches together, one local, one over a SCSI bus. Double the RAM on the primary server, and you're better off :) (I'm assuming that OS and application RAM overhead are minimal, which is probably true on a 4GB+ machine).

Alfio

Ricky Beam

10:33 p.m.

On Mon, 12 Jan 2004, Alfio Puglisi wrote:

...

Someone mentioned $6K for Apple raid storage, but that price is not so high: here at work we have some kind of raid NAS (half a terabyte I think) that went for about $4K. So yeah, NAS can be expensive, a simple internal

And it's certainly IDE based. You might have it connected to a SCSI card but the drives are the same unreliable IDE drives everyone throws away every year. SCSI hardware RAID is, strictly speaking, unnecessary. The two opterons in the box are far more powerful than any RAID card. Sure, hardware is nice, but not really necessary with good drives from the get-go (i.e. they're all SCSI and we don't need some magic from a hardware RAID card to make IDE performance acceptable.)

I currently have a $250 eurologic fibre channel shelf (FC7 or FC9) with dual power supplies and 7 146G seagate (300$ each) heating my apartment :-) The shelf is easy to find. Wiki would want new drives vs. the OEM pulls I'm using which doubles the cost of the drives. (they aren't easy to find used anyway.) That's just under 1TB. Each drive can stream ~60MB/s. Yes, EACH. (all the way to bus saturation: 2Gb/s)

--Ricky

PS: I have 6 more of those shelves loaded with 18G drives... old NetApp filer disks.

Nick Reinking

10:50 p.m.

On Jan 12, 2004, at 3:33 PM, Ricky Beam wrote:

...

On Mon, 12 Jan 2004, Alfio Puglisi wrote:

...
Someone mentioned $6K for Apple raid storage, but that price is not so high: here at work we have some kind of raid NAS (half a terabyte I think) that went for about $4K. So yeah, NAS can be expensive, a simple internal

And it's certainly IDE based. You might have it connected to a SCSI card but the drives are the same unreliable IDE drives everyone throws away every year. SCSI hardware RAID is, strictly speaking, unnecessary. The two opterons in the box are far more powerful than any RAID card. Sure, hardware is nice, but not really necessary with good drives from the get-go (i.e. they're all SCSI and we don't need some magic from a hardware RAID card to make IDE performance acceptable.)

I currently have a $250 eurologic fibre channel shelf (FC7 or FC9) with dual power supplies and 7 146G seagate (300$ each) heating my apartment :-) The shelf is easy to find. Wiki would want new drives vs. the OEM pulls I'm using which doubles the cost of the drives. (they aren't easy to find used anyway.) That's just under 1TB. Each drive can stream ~60MB/s. Yes, EACH. (all the way to bus saturation: 2Gb/s)

--Ricky

PS: I have 6 more of those shelves loaded with 18G drives... old NetApp filer disks.

I'm not saying you can't build a similar system for cheaper (possibly a couple thousand dollars cheaper). But, do we want to spend a lot of time building and debugging hardware? Do we want to be responsible for paying for any unforeseen hardware problems, or do we want to have some kind of support contract with someone? Imagine how much this Penguin Computing system would be costing us in time and money if we had to pay for all the replacement parts (especially when we are unsure where the problem lies.) Building your own hardware is great when you're desperate to save a few bucks, but if we've got the money, we might as well spend it on making things as reliable as possible. I think we would be doing a disservice to all those who made donations otherwise.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Ricky Beam

13 Jan 13 Jan

12:20 a.m.

On Mon, 12 Jan 2004, Nick Reinking wrote:

...

I'm not saying you can't build a similar system for cheaper (possibly a couple thousand dollars cheaper). But, do we want to spend a lot of time building and debugging hardware?

Uhhh, we're already doing that for a brand new server.

...

Do we want to be responsible for paying for any unforeseen hardware problems, or do we want to have some kind of support contract with someone? Imagine how much this Penguin Computing system would be costing us in time and money if we had to pay for all the replacement parts (especially when we are unsure where the problem lies.)

It'd cost exactly what it's currently costing. Everything in that server is new and thus supported both by the people who sold it to us and the people who made it. A pair of drives from Maxtor purchased 1.5 years ago were replaced in 48hours from the time requested the RMA. (it usually takes a week.)

...

Building your own hardware is great when you're desperate to save a few bucks, but if we've got the money, we might as well spend it on making things as reliable as possible. I think we would be doing a disservice to all those who made donations otherwise.

This isn't "penny pinching". The cost savings is enormous. And you're using hardware someone else has burned-in. Of course, you have to has some trust in the source of the gear.

If you want spankin' new, go price a shelf from IBM, DEC/Compaq, or Dell. A quick look at Dell shows a SCSI shelf with 14 36G 10k drives, dual 600W power supplies, rails, cables, etc. for ~6k. (ETA 8days.)

--Ricky

Nick Reinking

1:42 a.m.

...

If you want spankin' new, go price a shelf from IBM, DEC/Compaq, or Dell. A quick look at Dell shows a SCSI shelf with 14 36G 10k drives, dual 600W power supplies, rails, cables, etc. for ~6k. (ETA 8days.)

Listen, I don't care who the heck we go through - I'll leave that up to Jimbo. I'd just like a system with dual fibre channel (or SCSI, as this system is). Whatever gives us the greatest bang for the buck - I'm not trying to turn this into a partisan battle. 14x36GB is 504GB. If we think that would be fine for the reasonable future, then that's not a bad deal (at $6565 with the current 10% discount). (although later expansion would involve replacing, not adding, drives).

Dell PowerVault SCSI: - 14x36GB = 504GB - Dual SCSI U320 ports - 3yr support contract - ?? on cache

Total: $7294 ($6565 with discount that expires Jan 14th) Total: $8001 for 7x73GB ($7201 with current discount) - this gives future expansion space

Apple Xserve RAID: - 7x250GB = 1750GB (7200 RPM ATA drives) - 256MB cache (128MB for each RAID subsystem) - Dual fibre channel ports - 3yr Applecare support contract

Total: $8997 Total: $7497 for 4x250 = 1000GB

In any case, nobody has even decided that this a bad/good idea. Is it wise to spend a lot of money on a fast disk system, or should be be buying bigger beefier servers instead? As near as I can tell, the prices are pretty equal across the two boxes - the Dell would probably be a bit faster (lower latency), the Apple would have quite a bit more space. Penguin Computing also has storage gear, although they don't allow you to price it out on their website.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Alfio Puglisi

12 Jan 12 Jan

11:16 p.m.

On Mon, 12 Jan 2004, Ricky Beam wrote:

...

I currently have a $250 eurologic fibre channel shelf (FC7 or FC9) with dual power supplies and 7 146G seagate (300$ each) heating my apartment :-) The shelf is easy to find. Wiki would want new drives vs. the OEM pulls I'm using which doubles the cost of the drives. (they aren't easy to find used anyway.) That's just under 1TB. Each drive can stream ~60MB/s. Yes, EACH. (all the way to bus saturation: 2Gb/s)

Yes, if one needs throughput, IDE or SCSI will hardly matter, and the former will be cheaper. But it seems that our disk subsystem is hampered by latency - lots of small reads looking for DB indexes, articles, things that got pushed out from the cache, all of them from multiple concurrent threads. Short of a pure ramdisk, SCSI raid is still the leader in this deparments :)

Alfio

Lars Aronsson

13 Jan 13 Jan

10:54 p.m.

Alfio Puglisi wrote:

...

former will be cheaper. But it seems that our disk subsystem is hampered by latency - lots of small reads looking for DB indexes, articles, things that got pushed out from the cache, all of them from multiple concurrent

If things get pushed out of the cache, you need RAM, not SCSI disks. And you don't need 40 GB of RAM, you only need so much that things don't get pushed out of the cache so often. Today's ATA disks are far better than any SCSI disk of a few years ago. And only a few years further back, people used to say that SCSI was only toys and you really needed 8, 12 or 14 inch Eagle disks, and stuff like that. A recommended read is the first chapter of "The Innovator's Dilemma".

This design-by-committee approach to hardware problems is so boring.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se/

Nick Hill

3:24 a.m.

Nick Reinking wrote:

...

I'm not sure how running two copies of MySQL on a machine is going to be faster than running one. Plus, I don't think MySQL can even work in this configuration.

I am not saying running two copies per se will be faster. Just that running a copy from a database file in ramdisk will be much faster. One instance of Mysql whose data file is on the mechanical disk, one instance being a replication slave whose data is on the ram disk. Data served from the ram disk, not the hard disk.

I am almost certain multiple instances of MySQL will co-exist on the same machine. If not directly, then through a chrooted environment. Sockets can span chrooted environments using 'mount --bind'.

Jimmy Wales

2:12 p.m.

Nick Hill wrote:

...

...
I'm not sure how running two copies of MySQL on a machine is going to be faster than running one. Plus, I don't think MySQL can even work in this configuration.

I am not saying running two copies per se will be faster. Just that running a copy from a database file in ramdisk will be much faster. One instance of Mysql whose data file is on the mechanical disk, one instance being a replication slave whose data is on the ram disk. Data served from the ram disk, not the hard disk.

I think others have said this, but I think you're behind the times when you talk about a "ramdisk". We're not running DOS here. Linux will already cache as much as it can in memory, and in a sensible and automatic way.

But realistically speaking, this is not a bottleneck worth trying to solve. When it's working, the db server is very fast, and already holds everything that it needs in memory, as far as I know.

--Jimbo

Nick Hill

2:29 p.m.

Jimmy Wales wrote:

...

I think others have said this, but I think you're behind the times when you talk about a "ramdisk". We're not running DOS here. Linux will already cache as much as it can in memory, and in a sensible and automatic way.

But realistically speaking, this is not a bottleneck worth trying to solve. When it's working, the db server is very fast, and already holds everything that it needs in memory, as far as I know.

I understood the server was I/O bound from a recent posting. Server maxed out with low CPU utilisation. The information you and I have differs. You are in a position to have better information on this than I. I certainly don't want to waste time solving a non-existent problem.

Do we have a bottleneck? If so, where is it and how do we know?

Nick Hill

2:35 p.m.

Nick Hill wrote:

...

Do we have a bottleneck? If so, where is it and how do we know?

Jimbo answered this at 13:18. Messages crossed in the mail.

Gabriel Wicke

2:37 p.m.

On Tue, 13 Jan 2004 13:29:51 +0000, Nick Hill wrote:

...

Do we have a bottleneck? If so, where is it and how do we know?

The temporary server doesn't have enough ram, that's all. Geoffrin has 4Gb and worked fine, adding more ram to Ursula would do the same.

-- Gabriel Wicke

Jimmy Wales

2:46 p.m.

Nick Hill wrote:

...

I understood the server was I/O bound from a recent posting. Server maxed out with low CPU utilisation. The information you and I have differs. You are in a position to have better information on this than I. I certainly don't want to waste time solving a non-existent problem.

Yes, well, you weren't completely misinformed, but just missed one crucial detail. :-) We are currently I/O bound on the db server because the one that we're using sucks. It's what we are limping by on until the new hardware is installed, or until I manage to find something for Brion to put into service.

...

Do we have a bottleneck? If so, where is it and how do we know?

Bigger picture, I think that the configuration that we've discussed is the right way to go, and no exotic hardware configurations are needed to serve the website quickly, effectively, and reliably.

squids->webservers->db servers

with auxiliary servers for other jobs.

I'm putting the finishing touches on an order at Silicon Mechanics right now, and I'll report back to everyone once I have all the details and a firm delivery date promise.

--Jimbo

Gabriel Wicke

3:06 p.m.

On Tue, 13 Jan 2004 05:46:15 -0800, Jimmy Wales wrote:

...

I'm putting the finishing touches on an order at Silicon Mechanics right now, and I'll report back to everyone once I have all the details and a firm delivery date promise.

Jimbo-

how much ram are you going to order for the Squids? I've read that it's possible to address more than 4Gigs though highmem on x86, but only 3 Gigs per process. Squid is a single-thread asynchronous app, so this is a limit. Is there a major price penalty in a single-processor 64bit CPU? Having enough ram for all content on the Squids is a good thing for both speed and reliability- not an immediate problem for now, but it might get one as wikipedia grows. On the other hand x86 Squids could be converted to Apaches later if they are the same kind of machine.

The ram issue propably also applies to the DB server, but i guess you'll order 64bit anyway for that.

-- Gabriel Wicke

Nick Reinking

4:11 p.m.

On Tue, Jan 13, 2004 at 03:06:29PM +0100, Gabriel Wicke wrote:

...

On Tue, 13 Jan 2004 05:46:15 -0800, Jimmy Wales wrote:

...
I'm putting the finishing touches on an order at Silicon Mechanics right now, and I'll report back to everyone once I have all the details and a firm delivery date promise.

Jimbo-

how much ram are you going to order for the Squids? I've read that it's possible to address more than 4Gigs though highmem on x86, but only 3 Gigs per process. Squid is a single-thread asynchronous app, so this is a limit. Is there a major price penalty in a single-processor 64bit CPU? Having enough ram for all content on the Squids is a good thing for both speed and reliability- not an immediate problem for now, but it might get one as wikipedia grows. On the other hand x86 Squids could be converted to Apaches later if they are the same kind of machine.

The ram issue propably also applies to the DB server, but i guess you'll order 64bit anyway for that.

I would hazard a guess that 1GB, maybe 2GB is all that should be needed. That should provide plenty of RAM for disk caching. Linux and Squid are pretty good about that. No need to spend a lot of money on 64-bit machines for the Squid servers.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Neil Harris

4:31 p.m.

New subject: Access patterns, disks, etc.

== Access patterns ==

I did some stats ages ago, showing an approximately Zipfian distribution for page accesses. A bit of calculation shows that for small numbers of articles, this means a small amount of cache will give a large performace boost. However, as the number of articles increases, the vast majority of seldom-accessed articles will start to dominate the behavior of the system for article fetches. Thus, RAM caching will decrease in usefulness over time as the project progresses, unless the RAM cache is close to the size of the entire working set.

Nick's suggestion of tuning the filing system page size to the article size is a good idea; it will tend to make the RAM cache which is currently available more effective. I'm rather dubious about some of his other suggestions.

Where RAM caching is really important is in the "hot" data such as article timestamps and link tables. These have already been partially addressed by the use of memcached, I believe. These commonly accessed pieces of data should be small enough to keep in RAM all the time, giving a large speedup to the system.

== Seek bound performance ==

Since disk I/O requests are effectively random, the load will be dominated by seek and rotational latency. It will cost very nearly the same to pick 64kbytes off the disk for an article as to get 4 bytes for a timestamp.

Using high-performance disks and spreading the database across many RAID spindles should greatly increase performance.

I agree with the posters who are arguing for software RAID: it has higher performance than hardware RAID in many cases, and again, we can fine-tune stripe sizes etc. to our application. (Big stripe sizes are a bad idea for random-seek loads, but give better performance for streaming loads). We should also consider kernel 2.6: there are major gains in disk I/O performance in this kernel, and most of the teething troubles are not related to server issues.

== Not all disks are equal ==

Consider buying the disks specifically by access time statistics. In particular, high-performance SCSI disks should greatly out-perform IDE for random seek access patterns, even though their performance may be roughly the same for data streaming. SCSI command tagging will further increase performance, where there is concurrency on a single spindle.

See http://www.storagereview.com/php/benchmark/bench_sort.php for some interesting stats:

* a Fujitsu MAS3735 has an average read access time of 5.6ms, for a price of $700 for 73 GB. * a Hitachi Deskstar 7K250 has an average read access time of 12.1ms, for a price of $250 for 250 GB * a seagate U6 has an average read access time of 20.0ms, for a price of ??? for 80 GB

According to this, if performance is dominated by read access time, the most expensive drive should have almost four times the random-read performance of the cheapest, all else being equal.

Using price and performance figures such as those above, we should be able to calculate the best price/performance/storage compromise for this application.

== The Google strategy for article caching ==

Google seem to use a large number of RAM-based cache servers, based on the observation that network access latency on a small network is tiny, but disk latency is large. This does not make any sense for us now: we don't have the resources, unless Google open-source their Google filesystem.

For future expansion, it might be cheaper to buy 10 4Gbyte RAM commodity machines than one 40 Gbyte enterprise-class machine, and spread the load across them. Although this would still be costly, the performance of serving data directly from RAM would be very high.

-- Neil */

Nick Reinking

4:48 p.m.

New subject: Access patterns, disks, etc.

On Tue, Jan 13, 2004 at 03:31:29PM +0000, Neil Harris wrote:

...

== Not all disks are equal ==

Consider buying the disks specifically by access time statistics. In particular, high-performance SCSI disks should greatly out-perform IDE for random seek access patterns, even though their performance may be roughly the same for data streaming. SCSI command tagging will further increase performance, where there is concurrency on a single spindle.

See http://www.storagereview.com/php/benchmark/bench_sort.php for some interesting stats:

a Fujitsu MAS3735 has an average read access time of 5.6ms, for a

price of $700 for 73 GB.

a Hitachi Deskstar 7K250 has an average read access time of 12.1ms,

for a price of $250 for 250 GB

a seagate U6 has an average read access time of 20.0ms, for a price of

??? for 80 GB

Also, it should be noted that if we want to extend the usefulness of current machines, there are some 10K Western Digital IDE drives (basically SCSI drives with an IDE interface). They have an average read access time of 8.3ms (the WD740GD does, at least), and is $260 from Newegg for the same size (74GB). It also has a built-in PATA-to-SATA bridge, so it supports hot-swapping and command queueing.

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

Ricky Beam

6:27 p.m.

New subject: Access patterns, disks, etc.

On Tue, 13 Jan 2004, Nick Reinking wrote:

...

Also, it should be noted that if we want to extend the usefulness of current machines, there are some 10K Western Digital IDE drives (basically SCSI drives with an IDE interface). They have an average read access time of 8.3ms (the WD740GD does, at least), and is $260 from Newegg for the same size (74GB). It also has a built-in PATA-to-SATA bridge, so it supports hot-swapping and command queueing.

Indeed. Those drives are "bastard stepchildren"... it's a SCSI servo with IDE controller electronics tied to a SATA bus via a built-in bridge. It's a f'ing IDE drive. Stay away from it. If you want SATA (read: something cheaper than SCSI), buy real, native SATA drives which only Seagate are currently making. (which exact models, I don't know.)

Haven't we already had the "cheap is not the goal" mini-flame? If you want fast and reliable, it's gonna take some green. If you want cheap, it'll be less reliable and slow. I would personally recommend a FC array deployed under LVM2 with XFS riding on top. The array is physically expandable to 127 drives. And the drives are really fast, esp. when you have 100 of them in parallel :-)

--Ricky

audin＠okb-1.org

12 Jan 12 Jan

9:35 p.m.

On Mon, Jan 12, 2004 at 08:23:33PM +0000, Nick Hill wrote:

...

This would be cured by moving the database to solid state memory. Mechanical media has a very long seek time and when many seeks are required, become unreliable.

Either: Run two MySQL instances, one starting after the finest grained database files have been copied to ramdisk. Replicate database to a hard disk file.

Or Install a solid state IDE disk. For example, a 4Gb Compact Flash card has an IDE interface built in as part of the specifications. Access time 0.1ms comapred to mechanical 8.5ms. 85x faster. CF to IDE cables are trivial and available.

To put it another way, you would need 85 mechanical drives to provide the seek performance of a solid state equivalent.

Uh, except for that wondrous innovation known as caching...

Ramdisks have no place beyond MS/PC-DOS, Knoppix, and OS install floppies. Why dedicate a gob of memory to a ram disk when you can achieve similar performance with the OS's caching facility while also allowing that memory to be conscripted for other uses should the need arise?

Flash devices have rather serious write-cycle limitations. As mysql does not do anything to avoid hot spots, such a setup would likely not last long.

As to the Apple RAID boxes, my investigations a while back indicated that they are not really redundant. Each box is two seperate RAID devices. So you have to do software raid between them if you want to avoid a single point of failure. This may or may not be a problem, but is not explicitly mentioned in the apple literature. I'm not sure this is a deal breaker, but it is certainly something to think about.

-- Audin Malmin - audin@okb-1.org As a working hypothesis to explain the riddle of our existence, I propose that our universe is the most interesting of all possible universes, and our fate as human beings is to make it so. -- Freeman Dyson

Nick Reinking

9:56 p.m.

This is my understanding of how it works:

There are two RAID devices in each box. Each device supports up to 7 drives, and can use RAID 0, 1, 3, 5, 0+1, 10, 30, and 50 with these seven disks. Now, if you want to use the entire 14 in one gigantic RAID setup, you can do that with software RAID (you can do RAID 10, 30, or 50 across the two RAID systems).

This is not actually terribly optical. It is best to use the two RAID systems separately. If you want a gigantic file system, you can use LVM and pull both RAID systems in (probably running RAID 5). However, how I really think it would be most useful would be if we had two database servers. One would run en, and one would run the international wikis. Since the RAID system has two fibre channel ports, you can give each system its own external RAID hardware subsystem in the future. (so, both could have up to 1.75TB of storage at their disposal).

In the near term, just using half the machine (7 disks, just one RAID controller) would still make a huge improvement, I think. It's not the cheapest solution, but it is a lot more reliable than hoping the disks in the servers don't croak, and it gives us the ability to move the database around fairly easily if a server should die (just plug it into a different machine).

-- Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN

On Jan 12, 2004, at 2:35 PM, audin@okb-1.org wrote:

...

As to the Apple RAID boxes, my investigations a while back indicated that they are not really redundant. Each box is two seperate RAID devices. So you have to do software raid between them if you want to avoid a single point of failure. This may or may not be a problem, but is not explicitly mentioned in the apple literature. I'm not sure this is a deal breaker, but it is certainly something to think about.

Nick Hill

10:35 p.m.

audin@okb-1.org wrote:

...

Ramdisks have no place beyond MS/PC-DOS, Knoppix, and OS install floppies. Why dedicate a gob of memory to a ram disk when you can achieve similar performance with the OS's caching facility while also allowing that memory to be conscripted for other uses should the need arise?

Caching algorithms tend to use a least recently used algorithm. When larger blocks of data are read from storage, the least recently read fine grained data is flushed.

The cost of reading fine grained data is very high. This cost can be avoided by keeping fine grained data in ramdisk, unless some prioritising can be set up to avoid the fine grained data being flushed for much longer than the coarse grained data.

A 1Mb piece of data may take Seek + Transfer = I/O cost 8.5ms + 20ms = 28ms

a 10K piece of data may take 8.5ms + 0.2ms = 8.7ms

We can calculate this in terms of I/O blocking cost per given data size: 10k chunks (fine grained - eg article database) cost 870ms/Mb 1Mb Chunks (Coarse Grained- eg program code, media files) cost 28ms/Mb

Fine grained data is far more I/O costly than coarse grained data.

Is there another way to prioritise the fine grained data other than ramdisk? If so, this may be the key to a far more efficient system.

...

Flash devices have rather serious write-cycle limitations. As mysql does not do anything to avoid hot spots, such a setup would likely not last long.

We would need to know how hot the hotspots tend to be before implementing such a solution. Typical r/w on modern flash tend to be 1m-10m cycles.

Nikos-Optim

9:37 p.m.

how about the cost?

--- Nick Hill nick@nickhill.co.uk wrote:

...

This would be cured by moving the database to solid state memory.

__________________________________ Do you Yahoo!? Yahoo! Hotjobs: Enter the "Signing Bonus" Sweepstakes http://hotjobs.sweepstakes.yahoo.com/signingbonus

Gabriel Wicke

11:01 p.m.

On Mon, 12 Jan 2004 11:11:28 -0800, Brion Vibber wrote:

...

If Geoffrin's not going to be up soon, and Pliny's still emitting spurious disk errors, what are our options out of the available machines?

How about installing Squid on one of the machines? That would take a fair amount of load away. Is there a machine with some free Ram available? Even installing Squid on larousse would do i guess. I've glanced over the php code- there are mainly two header lines we would need to change to activate this- we could start off with a 30 minute timeout for anonymous users. Purging should get ready soon as well.

We can configure Squid depending on the available Ram- especially maximum object size in Ram and replacement policy. This should catch most small (=compressed html) files.

Let me know if you need help, i could help with that tomorrow.

-- Gabriel Wicke

Alfio Puglisi

11:26 p.m.

On Mon, 12 Jan 2004, Gabriel Wicke wrote:

...

On Mon, 12 Jan 2004 11:11:28 -0800, Brion Vibber wrote:

...
If Geoffrin's not going to be up soon, and Pliny's still emitting spurious disk errors, what are our options out of the available machines?

How about installing Squid on one of the machines? That would take a fair amount of load away. Is there a machine with some free Ram available?

It seems a good idea. It would eliminate the DB timestamp lookup for each page request. But the purging code must be implemented, and that would take time.

Alfio

Gabriel Wicke

11:35 p.m.

On Mon, 12 Jan 2004 23:26:02 +0100, Alfio Puglisi wrote:

...

On Mon, 12 Jan 2004, Gabriel Wicke wrote:

...
On Mon, 12 Jan 2004 11:11:28 -0800, Brion Vibber wrote:

...
If Geoffrin's not going to be up soon, and Pliny's still emitting spurious disk errors, what are our options out of the available machines?

How about installing Squid on one of the machines? That would take a fair amount of load away. Is there a machine with some free Ram available?

It seems a good idea. It would eliminate the DB timestamp lookup for each page request. But the purging code must be implemented, and that would take time.

Should take an hour maybe- had a look at it today. The Squid could even get installed before this, we would just need to set a relatively short timeout for anonymous users. That's five lines of php to change.

-- Gabriel Wicke

Nick Hill

13 Jan 13 Jan

1:07 a.m.

Gabriel Wicke wrote:

...

How about installing Squid on one of the machines? That would take a fair amount of load away. Is there a machine with some free Ram available? Even installing Squid on larousse would do i guess. I've glanced over the php code- there are mainly two header lines we would need to change to activate this- we could start off with a 30 minute timeout for anonymous users. Purging should get ready soon as well.

Perhaps I will be burned at the stake as a heretic for this, but I am not convinced squid proxies are the answer.

The delays in the wikiserver system are caused by waiting for I/O- the time taken for mechanical devices to seek a particular block of data. If the data is being served from a squid cache rather than from a cache on the wiki server, how will this reduce the overall I/O blocking problem? The busiest page data won't substantially add to I/O blocking on the wiki server as it will likely be in memory all the time. The squid proxy is ideal to solve the problem of network load from commonly accessed pages or pages which demand a lot of CPU power to generate but this is not a problem on wikipedia. If Squid proxies are being implemented to increase performance, then they are the right solution to the wrong problem. If they are to increase reliability by adding redundancy - multiple data sources-, they do this to a degree but are far from ideal.

The most commonly used pages are going to be in the memory of the database server so these are not costly to serve. The costly pages to serve are those which need disk seeks to serve. The more I/O seek operations a page requires, the more costly it is to serve.

The proxy server will need to make a database lookup (for the URL) and, unless the page is in memory rather than on-disk storage, use I/O to reach the fine grained data. The data for each unique URL will be bigger than that held in cache on the database server as it will contain html formatting and other page data. The likelihood of the data being in the memory of a proxy server is lower than the data being in memory of a similarly equipped database server as the data size of the final HTML page will be ~7.5k bigger than that of the database data.

If performance is the criteria, I suggest a proxy isn't a good idea. Instead, the memory otherwise used in a proxy would be better utilised caching database data directly. Either as a ramdisk or perhaps a network attached database storage with plenty of solid statememory.

From what I have gathered, the cost (limiting factor to performance) is that of delays seeking fine grained data. Either this seek load will need to be spread across many mechanical devices such that the work is not unduly duplicated, or store the fine grained data in solid state storage so that it can be seeked quickly.

Brion Vibber

1:23 a.m.

On Jan 12, 2004, at 16:07, Nick Hill wrote:

...

From what I have gathered, the cost (limiting factor to performance) is that of delays seeking fine grained data. Either this seek load will need to be spread across many mechanical devices such that the work is not unduly duplicated, or store the fine grained data in solid state storage so that it can be seeked quickly.

As a reminder, Geoffrin (the opteron box) is *perfectly fine* at this and handles the database load admirably. It's just out of service and replaced by a box (Ursula) with a hideously slow drive at the moment because that's what was available to get back online with.

The medium-term plan is simply to get Geoffrin back online, and to get _some_ machine with decently fast drives to serve as a replicated hot backup.

The ideas about squid caches etc are not about lightening the load on the database server (which it wouldn't really do except insofar as it may cache more stuff than our present on-web-server caching), but about lightening and spreading out the load on the web servers. Squid caches will *not* help the immediate question here to a significant degree; anyway no more than making slight alterations to the present caching code to avoid checking timestamps in some cases would.

Our present alternatives for database duty are:

* Pliny, which has done it in the past. Exhibiting intermittent errors on primary drive, and crashed a couple times when it ran the database again in late December, which is why we took it back off to Ursula.

* Carol (currently idle) with a SCSI drive that's too small for the whole database.

* Susan (currently idle) with another IDE drive that's likely not the fastest.

Unless somebody's got a clearer suggestion, I'll be following JeLuF's advice and in a few hours moving some of the more heavily trafficked European languages to one of these spare boxes to attempt to split the load on Ursula's poor drive.

-- brion vibber (brion @ pobox.com)

Gabriel Wicke

2:09 a.m.

On Mon, 12 Jan 2004 16:23:30 -0800, Brion Vibber wrote:

...

The ideas about squid caches etc are not about lightening the load on the database server

Ahem.

...

(which it wouldn't really do except insofar as it may cache more stuff than our present on-web-server caching), but about lightening and spreading out the load on the web servers. Squid caches will *not* help the immediate question here to a significant degree; anyway no more than making slight alterations to the present caching code to avoid checking timestamps in some cases would.

Our present alternatives for database duty are:

Pliny, which has done it in the past. Exhibiting intermittent errors

on primary drive, and crashed a couple times when it ran the database again in late December, which is why we took it back off to Ursula.

Carol (currently idle) with a SCSI drive that's too small for the

whole database.

Doesn't matter, that's what a replacement algo is for. There are much more sophisticated ones than LRU to pick from- both for ram and for disk.

...

Susan (currently idle) with another IDE drive that's likely not the

fastest.

If these two servers have some Ram they'll be fine. We could even disallow disk caching all together, only using their Ram.

-- Gabriel Wicke

Brion Vibber

5:16 a.m.

On Jan 12, 2004, at 16:23, Brion Vibber wrote:

...

Our present alternatives for database duty are:

As a reminder, we're not talking here about future machines to get, but about what's in the racks and online *right now*.

...

Carol (currently idle) with a SCSI drive that's too small for the

whole database.

Ran memtester and managed to kill the poor dear. Carol initially had an overheating problem and crashed soon after installation. Jason fiddled with the case and placement to improve airflow, but unfortunately this may not be enough. :(

Scratch Carol off the list for now. Carol's also got 4GB of RAM which would be _very_ nice to have up and running, assuming it all works!

...

Susan (currently idle) with another IDE drive that's likely not the

fastest.

Susan hasn't died during memory testing yet (yay). However it's a smaller machine with only 512MB of memory as well as the ide disk. Not ideal for database, though this might be sufficient to offload some work.

I won't touch the databases just yet...

-- brion vibber (brion @ pobox.com)

Gabriel Wicke

12:06 p.m.

On Mon, 12 Jan 2004 20:16:27 -0800, Brion Vibber wrote:

...

Scratch Carol off the list for now. Carol's also got 4GB of RAM which would be _very_ nice to have up and running, assuming it all works!

Are there unused cache slot on Ursula? If we could use some of this ram we should be set i guess- Geoffrin also has 4 Gigs and didn't seem to swap badly. Might be imcompatible ram though, don't know.

...

...

Susan (currently idle) with another IDE drive that's likely not the

fastest.

Susan hasn't died during memory testing yet (yay). However it's a smaller machine with only 512MB of memory as well as the ide disk. Not ideal for database, though this might be sufficient to offload some work.

Hm- could we stick some of Carol's ram into this one?

-- Gabriel Wicke

Gabriel Wicke

2:01 a.m.

On Tue, 13 Jan 2004 00:07:50 +0000, Nick Hill wrote:

...

Gabriel Wicke wrote:

...

The delays in the wikiserver system are caused by waiting for I/O- the time taken for mechanical devices to seek a particular block of data. If the data is being served from a squid cache rather than from a cache on the wiki server, how will this reduce the overall I/O blocking problem?

If we have an old machine Squid will serve anything cached straight from memory (small objects) or disk (images) without ever contacting the database. That's a speedup of at least 50x over the current disk cache with DB lookup etc. The bigger the Ram for the Squid the better of course, but 500Mb will already hold a lot of compressed html.

...

The most commonly used pages are going to be in the memory of the database server so these are not costly to serve. The costly pages to serve are those which need disk seeks to serve. The more I/O seek operations a page requires, the more costly it is to serve.

Yup. So lets avoid them.

...

The proxy server will need to make a database lookup (for the URL)

Nope. Only if a page is *not* in the cache or marked as not cacheable.

...

If performance is the criteria, I suggest a proxy isn't a good idea.

Well- please read up some docs. Or benchmark http://www.aulinx.de/ - commodity server (Celeron 2Ghz) running Squid.

-- Gabriel Wicke

Brion Vibber

2:15 a.m.

On Jan 12, 2004, at 17:01, Gabriel Wicke wrote:

...

If we have an old machine Squid will serve anything cached straight from memory (small objects) or disk (images) without ever contacting the database. That's a speedup of at least 50x over the current disk cache with DB lookup etc.

...which would only work as such if we make changes to the wiki which are roughly identical to the changes we'd have to make to the present caching system to avoid hitting the database to confirm timestamps. (ie, explicit purging of cached pages that are no longer valid)

Squid may be a great caching system, but for purposes of this discussion it's not really different from what we've got.

-- brion vibber (brion @ pobox.com)

Gabriel Wicke

2:48 a.m.

On Mon, 12 Jan 2004 17:15:38 -0800, Brion Vibber wrote:

...

Squid may be a great caching system, but for purposes of this discussion it's not really different from what we've got.

Believe me that Squid is *way* faster than any php caching system will ever be- with modification date checking or not. Especially at serving whatever fits into ram. Please benchmark http://www.aulinx.de/. I did from localhost- Squid is serving more than 700 requests per second at 55% Cpu with 46% taken up by siege. en2.wikipedia did about 8 Requests per second when i checked. Deatails at http://www.aulinx.de/oss/code/wikipedia/, about 1/3 down.

If you don't want to know or check- then fine. I'm just offering help.

-- Gabriel Wicke

Brion Vibber

3:06 a.m.

On Jan 12, 2004, at 17:48, Gabriel Wicke wrote:

...

On Mon, 12 Jan 2004 17:15:38 -0800, Brion Vibber wrote:

...
Squid may be a great caching system, but for purposes of this discussion it's not really different from what we've got.

Believe me that Squid is *way* faster than any php caching system will ever be- with modification date checking or not.

This thread is about disk access on the temporary database server.

No matter how insanely fast the caching is, it doesn't happen on the database server. Superduperfast squid caching that doesn't touch the database server and kinda speedy php caching that doesn't touch the database server are exactly equal in the eyes of the database server.

Lemme put it this way: you're talking about tuning up the engine on a car, when we're trying to discuss how the tires all are flat and the parking brake is stuck on. The car's not going to go very fast in this condition no matter how much the engine is improved.

-- brion vibber (brion @ pobox.com)

Gabriel Wicke

3:47 a.m.

On Mon, 12 Jan 2004 18:06:43 -0800, Brion Vibber wrote:

...

This thread is about disk access on the temporary database server.

No matter how insanely fast the caching is, it doesn't happen on the database server. Superduperfast squid caching that doesn't touch the database server and kinda speedy php caching that doesn't touch the database server are exactly equal in the eyes of the database server.

Sorry- i misunderstood you then (i admit to have only crossread the thread). But still the latter doesn't provide a timeout and is slower overall. Switching off time checking is really quick to do of course, so why not.

I'll try to do the purge script tomorrow, then we could use the idle machines as static caches without timeout (and switch the modification time check on again).

-- Gabriel Wicke

Nick Hill

3:09 a.m.

Gabriel Wicke wrote:

...

On Tue, 13 Jan 2004 00:07:50 +0000, Nick Hill wrote:

...

...
The most commonly used pages are going to be in the memory of the database server so these are not costly to serve. The costly pages to serve are those which need disk seeks to serve. The more I/O seek operations a page requires, the more costly it is to serve.

Yup. So lets avoid them.

Given that popular articles will be in the database memory cache, requests for popular articles should not lead to database HDD seeking.

I would expect a Squid proxy be best at serving popular pages and poor at serving less popular pages. So I can't imagine how squid is very helpful at saving HDD seeks.

...

...
The proxy server will need to make a database lookup (for the URL)

Nope. Only if a page is *not* in the cache or marked as not cacheable.

I meant the squid server will need to look up it's own database (in whatever form that may be- filesystem or indexed DBMS) to check if it has a copy of the required data- using the URL as a key. If it has a copy of the data, it will need to either pull it out of memory or from the disk. If not, forward the request then add the page to it's own database.

If the squid server needs to pull an article page off the disk, then disk I/O will be required in the same way as it would be required if an uncached piece of data is read from the database by the web server.

As I/O is the bottleneck, the squid server is likely to suffer the same problems as the underlying database server. These problems are likely to be bigger as the chunks of fine grained data handled by squid will be larger (compressed fully formed HTML pages) than the fine grained chunks of data (article text) handled by the database server.

...

...
If performance is the criteria, I suggest a proxy isn't a good idea.

Well- please read up some docs. Or benchmark http://www.aulinx.de/ - commodity server (Celeron 2Ghz) running Squid.

I am not contending that squid is not a very high performance server. I believe it is a high performance server. I believe it can substantially reduce the bandwidth needed by ISPs to serve web surfers.

The issue for wikipedia is how many disk accesses, in total, are needed for each article hit.

Wikipedia has millions of discrete pieces of data. Most pieces referenced singularly by a unique URL. Squid will not be able to hold a substantial proportion of these in memory. Squid will be able to hold fewer of these data chunks (articles/ HTML pages) in a given amount of memory than a database server could as the articles stored on the database consist of less data than the article rendered in HTML. For the larger articles, the comressed html page will be smaller, but for most articles, the compressed HTML page will be bigger. (the relative weight of the page HTML is much greater for short articles than for long articles. Compression reduces page size by about half.)

I assume viewing an article history page requires several pieces of information leading to multiple seeks per request: If squid were able to serve article histories, then a single I/O on the squid box could save several database seeks on the database server, providing a substantial economy. However, individual page histories are each fairly rare and forcing a squid cache reload on an article history page when an article is updated may be a poor use of resources.

I suggest four avenues for investigation: 1) Store articles in the MySQL table in compressed (gzip) format. This will reduce the size of the articles, making them fit more easily into the available cache memory, increasing the chances of a cache hit almost by a factor of two. Perhaps this can be made as a patch to MySQL. 2) Investigate ways of prioritising data cached in memory such that smaller chunks of data have a higher value than larger chunks so that smaller chunks are not flushed according to the basic least recently used algorithm. To reflect the relative cost of reading a small chunk of data from the HDD. 3) If the SQL code underlying wikipedia relies on temporary tables as part of the SQL queries, investigate whether the I/O of writing temporary tables tends to flush data from the disk cache. If so, write temporary tables to ramdisk or other storage which does not cause flushing. More recent versions of MySQL support sub-queries. This may obviate the need for temporary tables. 4) Judicious use of solid state storage. Could dramatically reduce seek times and I/O bottleneck. Some issues to resolve regarding flash memory durability and possible MySQL hotspots. Also cost of mass solid state storage. Might be worthwhile for some wiki data.

Nick Hill

1:41 p.m.

Another important point which occurs to me;

The article size on Wikipedia varies dramatically but average 1400 bytes. The size of entries in the old table- the updates- are likely to average even smaller.

By default, ext2/ext3 filesystem blocks tend to be 4k in size. I haven't yet investigated how big cached disk pages are in the Linux kernel- whether they mirror the size of filesystem blocks or are a multiple, or are fixed in number and vary in size with available memory etc.

If the cache page size can be tuned to the data atom size (which I feel it can, even if this needs a header value change + recompile) - that is tuned to the median average data size of articles, the number of popular articles held in a given amount of ram can possibly be increased by several fold. This will have a cost, perhaps not very big, in terms of scheduler work load and, if the pages are very small, memory needed by the kernel to administer a massive array of cache pages.

This solution would, nevertheless, be less optimal than a suitable solid state storage solution for the busiest wikis.

Nick Hill wrote:

...

I suggest four avenues for investigation:

Store articles in the MySQL table in compressed (gzip) format. This

will reduce the size of the articles, making them fit more easily into the available cache memory, increasing the chances of a cache hit almost by a factor of two. Perhaps this can be made as a patch to MySQL. 2) Investigate ways of prioritising data cached in memory such that smaller chunks of data have a higher value than larger chunks so that smaller chunks are not flushed according to the basic least recently used algorithm. To reflect the relative cost of reading a small chunk of data from the HDD. 3) If the SQL code underlying wikipedia relies on temporary tables as part of the SQL queries, investigate whether the I/O of writing temporary tables tends to flush data from the disk cache. If so, write temporary tables to ramdisk or other storage which does not cause flushing. More recent versions of MySQL support sub-queries. This may obviate the need for temporary tables. 4) Judicious use of solid state storage. Could dramatically reduce seek times and I/O bottleneck. Some issues to resolve regarding flash memory durability and possible MySQL hotspots. Also cost of mass solid state storage. Might be worthwhile for some wiki data.

Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Dave Caroline

2:07 a.m.

New subject: Secondary servers and the final solution

At 00:07 13/01/2004 +0000, you wrote:

...

Gabriel Wicke wrote:

...
How about installing Squid on one of the machines? That would take a fair amount of load away. Is there a machine with some free Ram available? Even installing Squid on larousse would do i guess. I've glanced over the php code- there are mainly two header lines we would need to change to activate this- we could start off with a 30 minute timeout for anonymous users. Purging should get ready soon as well.

Perhaps I will be burned at the stake as a heretic for this, but I am not convinced squid proxies are the answer.

You should not be burnt

...

The delays in the wikiserver system are caused by waiting for I/O- the time taken for mechanical devices to seek a particular block of data. If the data is being served from a squid cache rather than from a cache on the wiki server, how will this reduce the overall I/O blocking problem?

Agreed

...

The busiest page data won't substantially add to I/O blocking on the wiki server as it will likely be in memory all the time. The squid proxy is ideal to solve the problem of network load from commonly accessed pages or pages which demand a lot of CPU power to generate but this is not a problem on wikipedia. If Squid proxies are being implemented to increase performance, then they are the right solution to the wrong problem. If they are to increase reliability by adding redundancy - multiple data sources-, they do this to a degree but are far from ideal.

The most commonly used pages are going to be in the memory of the database server so these are not costly to serve. The costly pages to serve are those which need disk seeks to serve. The more I/O seek operations a page requires, the more costly it is to serve.

The proxy server will need to make a database lookup (for the URL) and, unless the page is in memory rather than on-disk storage, use I/O to reach the fine grained data. The data for each unique URL will be bigger than that held in cache on the database server as it will contain html formatting and other page data. The likelihood of the data being in the memory of a proxy server is lower than the data being in memory of a similarly equipped database server as the data size of the final HTML page will be ~7.5k bigger than that of the database data.

One solution is to lower the options available to the user to make the pages more static (unpopular)

...

If performance is the criteria, I suggest a proxy isn't a good idea. Instead, the memory otherwise used in a proxy would be better utilised caching database data directly. Either as a ramdisk or perhaps a network attached database storage with plenty of solid statememory.

Probably the only solution apart from splitting the various wikis to more servers

...

From what I have gathered, the cost (limiting factor to performance) is that of delays seeking fine grained data. Either this seek load will need to be spread across many mechanical devices such that the work is not unduly duplicated, or store the fine grained data in solid state storage so that it can be seeked quickly.

Or data replication over more/many db servers.

We really do need to spread the load (high power processors are less important than high speed disk.

As things are at the moment to cache the db to get full performance we need a server with 50gig ram (we cannot get 4gig working yet).

So for the next best thing I believe the suggested hot swap disk array is needed for both performance and reliability (2 off) to make two fast db servers.

plus squids apaches and dns round robin etc

Dave Caroline aka archivist

Jimmy Wales

2:18 p.m.

New subject: Secondary servers and the final solution

Dave Caroline wrote:

...

...
The delays in the wikiserver system are caused by waiting for I/O- the time taken for mechanical devices to seek a particular block of data. If the data is being served from a squid cache rather than from a cache on the wiki server, how will this reduce the overall I/O blocking problem?

Agreed

I think you're both wrong about this. This is not where the bottleneck is, when Geoffrin is running. It *is* where the bottleneck is right now, because we're limping along on broken hardware.

--Jimbo

7655

Age (days ago)

7656

Last active (days ago)

wikitech-l@lists.wikimedia.org

52 comments

15 participants

tags (0)

participants (15)

Alfio Puglisi
Audin Malmin
audin＠okb-1.org
Brion Vibber
Camille Constans
Dave Caroline
Gabriel Wicke
Jimmy Wales
Lars Aronsson
Magnus Manske
Neil Harris
Nick Hill
Nick Reinking
Nikos-Optim
Ricky Beam