Gabriel Wicke wrote:
I agree with all you're saying and like the thought of having a global cluster with arbitration, but i have some doubts:
- What's the minimum hardware capable of running the databases, the
webserver, the cache etc? Is all this possible on a cheap unit while still being fast? I would expect a RAM requirement of at least 4Gb, but i might be wrong. This would certainly increase once more languages start to grow, so it might be necessary to have separate machines for separate languages.
This depends on the size of data sets. The most busy fine grained data can be held in memory, eg a ramdisk. A machine with dual 64Bit Opteron, 8Gb ram and an Eric remote administration card weighs in at around US$4500. The machine can be upgraded to 16Gb. 64 bit is necessary for machines of over 4Gb. 4Gb is an addressable limit for 32bit.
I posted a possible hardware config to wikitech-l on 01/01/04 14:16.
- With the number of nodes increasing, replication traffic might be fairly
high (imagine mass undoing somebody's changes replicated to ten machines)
- encryption of replication traffic will drain the cpu, even a simple scp
does this- imagine the same for ten streams
A compressed SCP connection using blowfish cypher for English text between two AMD Athlon XP2200+ CPUs gives a throughput of 3.3Megabytes/sec. The majority of the load is on the sending machine. A dual Opteron 240 with a 64 bit optimised cipher algorithm can probably transfer 12Mb/sec or more.
If no single machine is critical and machines are widely separated, we would not even need to worry whether the machines are equipped with UPS or redundant supplies.
If the switchover is quick, this would be perfect- no need for separate backups and so on.
To get an idea of the hardware requirements it would be nice if somebody could install all of wikipedia on a cheap box and do some load testing on it ( if possible with replication).
I agree.
Running the fine-grained database (article database, not media, graphics) from ramdisk could eliminate I/O blocking and wear on hard drives possibly increasing performance by a massive degree. We should really experiment with this.
Do we have means to replay typical wikipedia activity from a log file? I am thinking of a PERL script which reads a real wikipedia common log file replaying the same load pattern at defineable speeds.
If someone has a pre-configured server image and some real log files, I can do this.
I have already written a simple PERL script which can be modified to replay server load in real time from log files.