I throught I would throw some ideas of configurations for new wikipedia systems:
Assuming we have a fast and reliable infrastructure for wikipedia to operate on, I would hope and expect many more people to benefit from the gret project. The new hardware configuration needs to have redundancy built in and reliability.
The DNS system is, by nature, distributed and designed to have many systems adding redundancy to the resolution service.
A single ip address will always be a single point of failure as the routers leading up to the physical ip destination will take some time to propogate a new physical destination for the ip address.
It therefore makes sense to have redundancy switching at the DNS level and having systems located at different network locations offering equivalent service.
This brings issues of updates and authentication into the question. The slave machine cannot authenticate or take database updates. A mechanism is needed to automatically nominate a machine as master or slave. The DNS system and each machine would respond to this.
Each wiki server, when booted, will, by default, be a slave. Each wiki server will periodically request master status from the arbitration server. The arbitration server can have an algorithm thus:
a)Each wiki server requests master status every 5 minutes. b)The arbitration/master DNS server checks whether the wiki server making the request was the last server to have the request granted. If it is, the request is granted again. c)If the server making the request is other than the wiki server who had the request granted last, then if the last grant was >10 minutes ago, the request is granted. Otherwise, the request is denied. d) If the master wiki server makes a request and does not receive a grant, it demotes itself to slave. (Including if it receives no answer at all)
A specific host name is used for all updates. The 'master hostname'. After each grant, the ip address of the master hostname is compared with the ip address of the machine which just received a grant. If it differs, the ip address of the master hostname is changed. This change is pushed to the slave DNS servers using ordinary BIND8 protocol. The DNS TTL for the domain name of the master wiki server would be set to 300 seconds. If a master went down for any reason, everyone should see the alternative master within 20 minutes.
If the arbitration server failed, all wikis would remain available for read access.
Redundancy switching: Exactly the same system can take care of system redundancy. The master server takes each request for master as an indication that the wiki server is ready to handle queries. To complement a master hostname is a 'slave hostname'. The slave hostname resolves to the ip address of any machine which is ready to handle a query. The DNS server may have multiple IP address entries for the slave hostname. The query load will be spread evenly between all machines whose ip address is registered against the slave hostname using round-robin.
If a machine fails to request master status for 10 minutes, it's ip address entry is de-registered from the slave hostname, all DNS servers are updated using normal BIND protocol. If a machine goes off-line, after 15 minutes, no more requests should be sent to it until it is working again.
This system provides a totally automatic redundancy switching, master arbitration and master/slave database selection.
-- Server requirements for fine grained databases: The database containining the text of each article is fine grained. Each article text being a few K in size. The cost of putting this in memory would be small. The benefits of avoiding hard disk head movements large.
Hard drive head movements are expensive; both in terms of I/O time (and therefore server performace) and mechanical wear and tear/ reliability. As the number of articles increases, the number of seeks across a disk surface increases. By copying the fine-grained database to ramdisk when the machine boots, and using the ramdisk image for all queries, much load is removed from the hard drive. System performance should improve (at least when first powered on) by an order of magnitude. The ramdisk database must be replicated to another database located on the physical hard disk 1) So that when the machine boots, a database image is available to load into ramdisk and 2) In order to keep a more solid copy than the ephemeral ramdisk image.
Dual 64 bit Opteron mainboards are available with 8 DIMM sockets capable of taing 16Gb. eg: http://www.tyan.com/products/html/thunderk8w_spec.html
If the system will be using >4Gb memory, moving to 64bit is a good idea. 4Gb is an addressable limit for 32 bit. Mainboard $432: http://shopper.cnet.com/Tyan_Thunder_K8W_S2885ANRF___mainboard___extended_AT...
Memory throughput is probably more important than CPU Mhz. 2xAMD Opteron 240 @$220 each =$440
8x1Gb Registered ECC $300 each http://www.crucial.com/store/listModule.asp?module=DDR+PC2700&cat=RAM&am...
Mainboard and CPU solution: 8Gb Dual 64 bit Opteron hypertransport with 8Gb ram: US$3272 Eric card: http://us.daxten.com/overview.cfm?prodID=37 $715 plus ancilliaries (good PSU, two HDDs CD rom with Knoppix left in for remote system repair via Eric. Total around $4500. Two of these to provide redundancy comes to around $9000.
Nick Hill wrote:
A specific host name is used for all updates.
For example, updates.wikipedia.org would receive form data from any update. Which physical server/ ip address this goes to depends upon the result of the negotiation system I previously described.
The 'master hostname'. After each grant, the ip address of the master hostname is compared with the ip address of the machine which just received a grant.
To put it clearer, if a different machine has just been granted master status, the ip address which updates.wikipedia.org resolves to will change to the ip address of the machine which just been granted master status. Bind reloads and the change is pushed to the secondary DNS servers.
If it differs, the ip address of the master hostname is changed. This change is pushed to the slave DNS servers using ordinary BIND8 protocol. The DNS TTL for the domain name of the master wiki server would be set to 300 seconds. If a master went down for any reason, everyone should see the alternative master within 20 minutes.
Each wiki server has a cron job running which periodically runs a shell script. This shell script first checks the web server and database are running as expected by performing an http get command and comparing the result of the get with the expected result. If they match, the database and web server are considered working. If this fails, the script exits. If the script keeps exiting, the arbitration server will never receive notification the machine works. The arbitration server will remove the IP address of the failing wiki server from the DNS record.
If the http get is successful, the shell script goes on to invokes SSH to connect to the arbitration server and run a script on the arbitration server. The return status of the script on the arbitration server is communicated as the return value of the ssh command to the script running as a cron job on the wiki server. This return value tells the wiki server whether it has been granted master status.
The script running on the remote machine updates the hosts file for the DNS server according to the algorithms previously described.
All this can be achieved using cron, shell scripting, perhaps a little PERL for editing the hosts file on the arbitration server (although this can be done in shell also) and SSH with public/private keys for communication between the machines.
If the arbitration server failed, all wikis would remain available for read access only.
The arbitration server is the same as the master DNS server. The secondary DNS servers should be running a recent version of Bind and configured to accept push updates from the master.
Whenever a machine tries to obtain master ststus and fails, the ip address of the current master server is compared to the current master MySQL server ip address. If it differs, the config file is autmatically updated and mysql reloaded.
I have uploaded a diagram in pdf and Openoffice draw illustrating the redundant system idea.
http://www.nickhill.co.uk/wikipedia-redundant-system.pdf http://www.nickhill.co.uk/wikipedia-redundant-system.sxd
Nick Hill wrote:
I have uploaded a diagram in pdf and Openoffice draw illustrating the redundant system idea.
Hm. You said you were trying to eliminate single points of failure, but in this system it seems that what you call the "Arbitration Script" would be such a single point of failure?
I'm probably missing something. Sorry if I am.
Timwi
Timwi wrote:
Nick Hill wrote:
I have uploaded a diagram in pdf and Openoffice draw illustrating the redundant system idea.
Hm. You said you were trying to eliminate single points of failure, but in this system it seems that what you call the "Arbitration Script" would be such a single point of failure?
I'm probably missing something. Sorry if I am.
If the arbitration script failed, the master will cease to accept updates but all the wikis will still be available for read-only access.
On Thu, Jan 01, 2004 at 02:16:21PM +0000, Nick Hill wrote:
I throught I would throw some ideas of configurations for new wikipedia systems:
Assuming we have a fast and reliable infrastructure for wikipedia to operate on, I would hope and expect many more people to benefit from the gret project. The new hardware configuration needs to have redundancy built in and reliability.
The DNS system is, by nature, distributed and designed to have many systems adding redundancy to the resolution service.
[....]
The idea of having a dedicated http server for updates relies on DNS entries with a very short time to live (TTL). Else DNS servers all over the world would cache the DNS entry for updates.wikipedia.org for a long time. Most DNS servers handle entries with a short TTL correctly, but several browsers don't. They cache IP adresses, but don't honor the TTL. Mozilla is one of those. My home box has a dyndns.org hostname and Mozilla caches the IP for several days despite the TTL being 5 minutes or so.
I think it's better to have the web servers load balanced by a tool like http://www.linuxvirtualserver.org/ and have MySQL database replication between two database servers.
Regards,
JeLuF
Jens Frank wrote:
The idea of having a dedicated http server for updates relies on DNS entries with a very short time to live (TTL). Else DNS servers all over the world would cache the DNS entry for updates.wikipedia.org for a long time. Most DNS servers handle entries with a short TTL correctly, but several browsers don't. They cache IP adresses, but don't honor the TTL. Mozilla is one of those. My home box has a dyndns.org hostname and Mozilla caches the IP for several days despite the TTL being 5 minutes or so.
A way to overcome this problem is to append a random fourth level domain name to the URLs pointing to the update server. This way, each time the update server is referenced, the ip address is refreshed.
eg Where <anything> is replaced by a random string:
http://<anything>.update.wikipedia.org/ resolves via a wildcard DNS entry to the current ip address of update.wikipedia.org. This way, whenever an update link is pressed, the most up to date IP address is fetched from the DNS.
What about sticking the entire cluster on private IP's and having a load balancing firewall appliance handle traffic flow which would randomly hit the update box?
-----Original Message----- From: wikitech-l-bounces@Wikipedia.org [mailto:wikitech-l-bounces@Wikipedia.org] On Behalf Of Nick Hill Sent: Friday, January 02, 2004 11:58 AM To: Wikimedia developers Subject: Re: [Wikitech-l] New system ideas
Jens Frank wrote:
The idea of having a dedicated http server for updates relies on DNS entries with a very short time to live (TTL). Else DNS servers all over the world would cache the DNS entry for updates.wikipedia.org for a long time. Most DNS servers handle entries with a short TTL correctly, but several browsers don't. They cache IP adresses, but don't honor the TTL. Mozilla is one of those. My home box has a dyndns.org hostname and Mozilla caches the IP for several days despite the TTL being 5 minutes or so.
A way to overcome this problem is to append a random fourth level domain name to the URLs pointing to the update server. This way, each time the update server is referenced, the ip address is refreshed.
eg Where <anything> is replaced by a random string:
http://<anything>.update.wikipedia.org/ resolves via a wildcard DNS entry to the current ip address of update.wikipedia.org. This way, whenever an update link is pressed, the most up to date IP address is fetched from the DNS.
_______________________________________________ Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On Fri, Jan 02, 2004 at 12:59:14PM -0800, Tim Thorpe wrote:
What about sticking the entire cluster on private IP's and having a load balancing firewall appliance handle traffic flow which would randomly hit the update box?
I think Nick is trying to have a global cluster, and you can not achieve this with a simple load balancer. If one data center goes down, the load balancer would go down, too.
DNS has ways to handle outage of a server, and Nick tries to use these.
JeLuF
Jens Frank wrote:
What about sticking the entire cluster on private IP's and having a load balancing firewall appliance handle traffic flow which would randomly hit the update box?
I think Nick is trying to have a global cluster, and you can not achieve this with a simple load balancer. If one data center goes down, the load balancer would go down, too.
That's right. Nick's proposal, though, solves a problem that we don't actually have. It is of course possible for an entire data center to go down, but it's extremely rare and should be close to the bottom of our list of problems to solve.
The much more likely scenario -- one that we live with every day -- is that a single webserver will fall over. After we get that problem solved, with a load balancing firewall appliance basically, linuxvirturalserver.org stuff, then we can also consider such things as global clustering and locating servers all over the planet or whatever.
--Jimbo
Tim Thorpe wrote:
What about sticking the entire cluster on private IP's and having a load balancing firewall appliance handle traffic flow which would randomly hit the update box?
There would be several points of failure with such a system.
To make a system really reliable, all single points of failure need to be removed.
Eg: Building specific hazards- power loss (UPS sometimes fail), network cables cut, fire, burglary, landlord reposession, hosting company bankruptcy, malicious attack, human error, plane crash etc.
Machine specific hazards- any single machine failing in the system bringing everything down- either hardware failure or malicious attack.
Any single segment of the network failing bringing the system down- hardware failure, human error or malicious attack.
I believe a design philosophy where the system is immune from any single element failing is both the most cost-effective and the most reliable. Rather than invest heavily for reliability in mission-critical systems, make no system mission critical. No system then needs to have mission-critical investment. The overall system will then be cheaper and more reliable.
To put it another way: All systems will fail. The probability of a single reliable costly unit failing is still fairly high. The probability of many fairly reliable cheap units with no common point of failure breaking down simaultaneously is much lower than the probability of a costly reliable unit failing.
If no single machine is critical and machines are widely separated, we would not even need to worry whether the machines are equipped with UPS or redundant supplies.
On Fri, 02 Jan 2004 23:34:14 +0000, Nick Hill wrote:
The probability of many fairly reliable cheap units with no common point of failure breaking down simaultaneously is much lower than the probability of a costly reliable unit failing.
I agree with all you're saying and like the thought of having a global cluster with arbitration, but i have some doubts:
* What's the minimum hardware capable of running the databases, the webserver, the cache etc? Is all this possible on a cheap unit while still being fast? I would expect a RAM requirement of at least 4Gb, but i might be wrong. This would certainly increase once more languages start to grow, so it might be necessary to have separate machines for separate languages.
* With the number of nodes increasing, replication traffic might be fairly high (imagine mass undoing somebody's changes replicated to ten machines)
* encryption of replication traffic will drain the cpu, even a simple scp does this- imagine the same for ten streams
If no single machine is critical and machines are widely separated, we would not even need to worry whether the machines are equipped with UPS or redundant supplies.
If the switchover is quick, this would be perfect- no need for separate backups and so on.
To get an idea of the hardware requirements it would be nice if somebody could install all of wikipedia on a cheap box and do some load testing on it ( if possible with replication).
Gabriel Wicke
Gabriel Wicke wrote:
I agree with all you're saying and like the thought of having a global cluster with arbitration, but i have some doubts:
- What's the minimum hardware capable of running the databases, the
webserver, the cache etc? Is all this possible on a cheap unit while still being fast? I would expect a RAM requirement of at least 4Gb, but i might be wrong. This would certainly increase once more languages start to grow, so it might be necessary to have separate machines for separate languages.
This depends on the size of data sets. The most busy fine grained data can be held in memory, eg a ramdisk. A machine with dual 64Bit Opteron, 8Gb ram and an Eric remote administration card weighs in at around US$4500. The machine can be upgraded to 16Gb. 64 bit is necessary for machines of over 4Gb. 4Gb is an addressable limit for 32bit.
I posted a possible hardware config to wikitech-l on 01/01/04 14:16.
- With the number of nodes increasing, replication traffic might be fairly
high (imagine mass undoing somebody's changes replicated to ten machines)
- encryption of replication traffic will drain the cpu, even a simple scp
does this- imagine the same for ten streams
A compressed SCP connection using blowfish cypher for English text between two AMD Athlon XP2200+ CPUs gives a throughput of 3.3Megabytes/sec. The majority of the load is on the sending machine. A dual Opteron 240 with a 64 bit optimised cipher algorithm can probably transfer 12Mb/sec or more.
If no single machine is critical and machines are widely separated, we would not even need to worry whether the machines are equipped with UPS or redundant supplies.
If the switchover is quick, this would be perfect- no need for separate backups and so on.
To get an idea of the hardware requirements it would be nice if somebody could install all of wikipedia on a cheap box and do some load testing on it ( if possible with replication).
I agree.
Running the fine-grained database (article database, not media, graphics) from ramdisk could eliminate I/O blocking and wear on hard drives possibly increasing performance by a massive degree. We should really experiment with this.
Do we have means to replay typical wikipedia activity from a log file? I am thinking of a PERL script which reads a real wikipedia common log file replaying the same load pattern at defineable speeds.
If someone has a pre-configured server image and some real log files, I can do this.
I have already written a simple PERL script which can be modified to replay server load in real time from log files.
On Sat, 03 Jan 2004 11:41:59 +0000, Nick Hill wrote:
Do we have means to replay typical wikipedia activity from a log file? I am thinking of a PERL script which reads a real wikipedia common log file replaying the same load pattern at defineable speeds.
If someone has a pre-configured server image and some real log files, I can do this.
I have already written a simple PERL script which can be modified to replay server load in real time from log files.
http://www.cs.virginia.edu/~rz5b/software/logreplayer-manual.htm could be useful, or http://opensta.org/.
Even more useful would be a full incoming traffic dump and a full disk image of the start condition (it's mainly the editing that is interesting). tcpdump and tcpreplay might be good candidates for this.
If it turns out that the minimum server for the current database indeed would be a 64bit, 8Gb 'cheap' machine- then it might get pretty hard when the database grows (especially with multiple languages etc). Just an import of http://www.meyers-konversationslexikon.de/ into the german wikimedia (not to think of its nice illustrations) would create problems i guess.
A compressed SCP connection using blowfish cypher for English text between two AMD Athlon XP2200+ CPUs gives a throughput of 3.3Megabytes/sec. The majority of the load is on the sending machine. A dual Opteron 240 with a 64 bit optimised cipher algorithm can probably transfer 12Mb/sec or more.
The sending machine would have to create compressed & encrypted streams for whatever the number of mirrors is. It's not its main job, but something that takes some cpu time away from doing the database work etc. How high was the cpu load of just the scp process on the Athlon?
Gabriel Wicke
The obstacle is the DB server, to have an offsite dump 24/7 of the DB server would effective double the bandwidth used for each transaction. I work for an ISP N+1 takes care of most DC internal issues, the DC that is being used is Verio which is a VERY large hosting company, I don't see them going down any time soon. Some advanced routers have a dial up redundancy capability where it can phone on a separate land line an off-site router to inform it that there is a network issue on one end and that all data needs to be re-routed to the secondary stack.
You can never eliminate all single points of failure, but I also see that some of us are loosing our heads when it comes to solutions. Some suggested solutions would cost not in the tens of thousands but hundreds of thousands to implement.
Remember the golden rule to network engineering, KEEP IT SIMPLE STUPID! ;).
-----Original Message----- From: wikitech-l-bounces@Wikipedia.org [mailto:wikitech-l-bounces@Wikipedia.org] On Behalf Of Nick Hill Sent: Friday, January 02, 2004 3:34 PM To: Wikimedia developers Subject: Re: [Wikitech-l] New system ideas
Tim Thorpe wrote:
What about sticking the entire cluster on private IP's and having a load balancing firewall appliance handle traffic flow which would randomly hit the update box?
There would be several points of failure with such a system.
To make a system really reliable, all single points of failure need to be removed.
Eg: Building specific hazards- power loss (UPS sometimes fail), network cables cut, fire, burglary, landlord reposession, hosting company bankruptcy, malicious attack, human error, plane crash etc.
Machine specific hazards- any single machine failing in the system bringing everything down- either hardware failure or malicious attack.
Any single segment of the network failing bringing the system down- hardware failure, human error or malicious attack.
I believe a design philosophy where the system is immune from any single element failing is both the most cost-effective and the most reliable. Rather than invest heavily for reliability in mission-critical systems, make no system mission critical. No system then needs to have mission-critical investment. The overall system will then be cheaper and more reliable.
To put it another way: All systems will fail. The probability of a single reliable costly unit failing is still fairly high. The probability of many fairly reliable cheap units with no common point of failure breaking down simaultaneously is much lower than the probability of a costly reliable unit failing.
If no single machine is critical and machines are widely separated, we would not even need to worry whether the machines are equipped with UPS or redundant supplies.
_______________________________________________ Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On Sat, 03 Jan 2004 11:37:04 -0800, Tim Thorpe wrote:
The obstacle is the DB server, to have an offsite dump 24/7 of the DB server would effective double the bandwidth used for each transaction. I work for an ISP N+1 takes care of most DC internal issues, the DC that is being used is Verio which is a VERY large hosting company, I don't see them going down any time soon. Some advanced routers have a dial up redundancy capability where it can phone on a separate land line an off-site router to inform it that there is a network issue on one end and that all data needs to be re-routed to the secondary stack.
You can never eliminate all single points of failure, but I also see that some of us are loosing our heads when it comes to solutions. Some suggested solutions would cost not in the tens of thousands but hundreds of thousands to implement.
Remember the golden rule to network engineering, KEEP IT SIMPLE STUPID! ;).
I agree.
The idea of a WikiTella thing is fascinating but seems to be very hard to implement now.
Factors that would help WikiTella: * hardware performance grows really fast * bandwidth gets really cheap
If somebody would manage to get a prototype of this working under load and with little bandwidth requirements i would be all for this solution, but i have some doubts that this will happen soon. A 'cheap computer' that would work in such a setup would most propably need to be quite a few times quicker than the current cheap ones. And wikipedia's demand might grow quicker than cheap computer's performance. Mirrors could hardly be simple (old) boxes provided by a university or isp. They would have to be brand new machines bought by wikipedia or sponsored. Or a similar setup as the current one with multiple machines and the associated administration work (but this might buy more horsepower for the money).
Gabriel Wicke
Nick Hill wrote:
A single ip address will always be a single point of failure as the routers leading up to the physical ip destination will take some time to propogate a new physical destination for the ip address.
It therefore makes sense to have redundancy switching at the DNS level and having systems located at different network locations offering equivalent service.
At some point, yes, I agree completely. But it's important that we focus our energy on solving problems that we actually have, rather than problems which are small and unlikely. We have no plans to host anywhere other than professional facilities with full redundancy of everything important, and so the chances of a failure of the type you are imagining impacting us are very small.
To improve reliability, it makes sense to focus our efforts on the sorts of things that are most likely to go wrong, first. And then, later, if we can afford it and if it makes economic sense, we can focus on the more esoteric risks.
Dual 64 bit Opteron mainboards are available with 8 DIMM sockets capable of taing 16Gb. eg: http://www.tyan.com/products/html/thunderk8w_spec.html
Right, we have one of those (though broken at the moment). Right now, our entire db should fit comfortably in 4 gig of Ram, which we have.
But I totally agree with you about the importance of avoiding the hard drive as much as possible.
--Jimbo
Dual 64 bit Opteron mainboards are available with
8 DIMM sockets capable
of taing 16Gb. eg:
http://www.tyan.com/products/html/thunderk8w_spec.html
Right, we have one of those (though broken at the moment). Right now, our entire db should fit comfortably in 4 gig of Ram, which we have.
--Jimbo
thre broken mobby is a Tyan??
__________________________________ Do you Yahoo!? New Yahoo! Photos - easier uploading and sharing. http://photos.yahoo.com/
On Mon, 5 Jan 2004, Nikos-Optim wrote:
thre broken mobby is a Tyan??
Where UPS is involved, nothing is safe. These things happen. (I'll assume someone checked Tyan's approved memory list to make sure it's not just something lame with the memory making it usable on that MB.)
--Ricky
Ricky Beam wrote:
On Mon, 5 Jan 2004, Nikos-Optim wrote:
thre broken mobby is a Tyan??
Where UPS is involved, nothing is safe. These things happen. (I'll assume someone checked Tyan's approved memory list to make sure it's not just something lame with the memory making it usable on that MB.)
Actually, I'm not 100% sure what the mobo is, and Jason's not around right now for me to ask. My point was, we have a motherboard that can accept up to 16 gig of ram. We have 4 gig, and plenty of slots available for further expansion.
And yes, the Ram is perfectly fine, and Jason has tested it. What he discovered is that one slot on the mobo is bad. So Penguin is sending a new one for us.
--Jimbo
The MB is an Arima HDAMA...http://www.rioworks.com/Download/HDAMA.htm
Jason
Jimmy Wales wrote:
Ricky Beam wrote:
On Mon, 5 Jan 2004, Nikos-Optim wrote:
thre broken mobby is a Tyan??
Where UPS is involved, nothing is safe. These things happen. (I'll assume someone checked Tyan's approved memory list to make sure it's not just something lame with the memory making it usable on that MB.)
Actually, I'm not 100% sure what the mobo is, and Jason's not around right now for me to ask. My point was, we have a motherboard that can accept up to 16 gig of ram. We have 4 gig, and plenty of slots available for further expansion.
And yes, the Ram is perfectly fine, and Jason has tested it. What he discovered is that one slot on the mobo is bad. So Penguin is sending a new one for us.
--Jimbo _______________________________________________ Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org