It's quite evident that wikimedia's current network is a mess. We have three rather dumb (but cheap) netgear gigabit switches, that offer some manageability features. However, noone seems to know what's on which ports. The interfaces to manage them seem rather limited (I don't know - I don't have access to them), and the features they have lack those to build a large network. It's quite clear that nobody really likes these switches, and would like to buy other, better ones now we've run out of switch ports again.
As projected server count for wikimedia in dec 2005 is about 500 servers, it's time to start properly planning the design of the network. While complexity of the network increases, remote manageability becomes more important. Most admin duties happen remotely, while only Jimbo has physical access to the actual hardware. This way, admins are restricted to telling him what to do, whenever he has time for it and is on location. This really delays and complicates things, so I think it would be good to make sure we build a network makes REMOTE management as easy and flexible as possible, and keep required physical changes to a minimum.
This does require switches that are more expensive than the current ones, and it is rather hard to justify the cost for them. Technically, wikimedia projects CAN run on cheap, unmanageable switches, since they DO the most important part of the job: switching, at gigabit speeds. However, with every new server and extra switch, remote manageability is getting harder, and consequently, the network is getting a mess. Admins don't know exactly what's on a port. It's barely documented, and they can't find out through the switch's management interfaces either. In case of network problems, there are hardly any graphs, logs or other sources of information to find out what's going on. The current setup is feasible when one has < 24 ports, but it gets really messy when the network grows...
I think we need at least layer 2 switches with basic manageability features. Basic as in what's basic in any medium to large company network these days. Some features we really could use are:
* serial ports, so we can manage them out of band through a console server even if the network is down * SNMP, so we can properly graph statistics of the switch itself, and the individual ports. VERY helpful in case of problems... * spanning tree (STP), especially helpful in large networks with remote management * VLANs and 802.1Q support. Allows one set of switches to be used for multiple virtual LANs, and allows for more flexible and cost effective use of resources, remotely WITHOUT changes to the physical network setup * Diagnostic information from the switch's console - port descriptions, port statistics, port status, mac address information, vlan assignments, error rates, etc * Syslog logging, so we notice what's going on * centralized administration, so we don't have to manually copy everything to each and every switch * upgradeable firmware with long term support * Port trunking/aggregation, for high bandwidth or redundancy needs * IGMP/multicast support, could be helpful on a large network too
While the current netgear switches do have a few of the features mentioned above, it's all too limited, too restricted, and too non standard to be useful in a large network.
Wikimedia also needs SOME layer 3 and layer 4 features, but these are less important, and generally MUCH more expensive, so I don't think we can really justify to do this using switch hardware:
Layer 3 routing. While we (intend to) have at least two different vlans/networks, an external and an internal, some traffic needs to be routed/NATed between them. This does NOT involve actual wikimedia client traffic, but it does involve some traffic needed for management of the servers, like retrieving software updates, sending mails etc. This won't be a lot of traffic, and we could do this using NAT on some server, or for example on an LVS loadbalancing box (more on this later).
Layer 4 load balancing. Currently, load balancing between squid boxes happens through multiple DNS A-records, and this clearly isn't optimal. A true load balancing would be a lot better. There are layer 4 switches that support load balancing to some extent, but these are generally VERY expensive, $10,000 and higher. A cheaper, and probably more flexible alternative is a setup using multiple redundant Linux LVS (Linux Virtual Server) boxes. Hashar and a friend of his who has experience using LVS for large clusters are preparing a presentation on this.
Firewalling. I personally don't think we really need this, especially not once all fundamentally internal servers are on an internal vlan, but some think it could be useful to have a central firewall, and a layer 3/4 switch could do this.
Personally, I think it would be good to build the wikimedia network on proper layer 2 switches that can support switching in a large network with decent manageability. Layer 3 and up (routing/NAT, load balancing, firewalling if needed) we can do using a redundant LVS cluster, which hashar is working on. This has the benefit that we don't have to spend overly excessive amounts of money on proprietary hardware stuff, and still get a very flexible and cost effective solution, with as much free software involved as reasonably possible.
Also important I think, is that we choose a vendor that can offer us a full range of products from low end to as high end as we will ever need. When you have a large network, it's not feasible to work with many different interfaces, command sets, features and terminology on each switch, you'd rather want it to be reasonably consistent among different switches and product ranges.
We could build the network out of a nice, decent core switch (possibly two for redundancy), and multiple, relatively cheap access switches to connect servers (for example, Cisco 2948G-GE-TX). Alternatively, we could build a large virtual switch by stacking multiple smaller ones (for example, out of Cisco 3750s), but we might run against a stacking limit there, and these switches are generally quite a bit more expensive than non stackable ones. We could even build up the entire network out of just one very big, and very expensive modular switch which has hundreds of ports, but this will be very hard to make redundant (although these switches are pretty redundant of themselves...), and also involves a big initial investment.
Redundancy is something we need to think about. Of course we can buy one big and expensive switch, but what if it breaks? With multiple cheaper switches, it's more feasible to have one or two on spare.
Some servers have larger dependencies between them in terms of low latency and high bandwidth than others. This obviously needs to be taken into account while designing the network.
This needs more discussion...
I mentioned Cisco examples here, but that's only because I personally have experience with them, they have a whole line of product ranges, and prices are readily available. Of course, many other good switch vendors exist (Foundry, HP, Extreme, Nortell, etc...), and many could provide us with equivalent products we need. We have to look for alternatives as well...
It would especially be helpful if someone could get one of the major network hardware vendors to donate network hardware to us, but I think if that would happen, it would have to be a donor/partner for the long term, and not just for a single donation. We can't build a large and consistent network out of single, uncoordinated donations.
Comments, please!
(we could transfer this to a wiki if that's helpful...)
On Sun, 19 Dec 2004, Mark Bergsma wrote:
It's quite evident that wikimedia's current network is a mess.
It's always qualified as "a mess". Every network ends up looking like Wiki's at some point. No matter how well planned, documented, or managed, the wiring closet will eventually look like a huricane of cobwebs.
... However, noone seems to know what's on which ports.
$79 switch or $100,000 switch... if no one documents what's plugged into which port (and where all the cables go), you won't know what's plugged into which port. Comparing MAC addresses everytime you need to know where something is attached is very time consuming and error prone. *Maintain* the documentation. That's pretty easy as there's only one monkey movin' cables.
... The interfaces to manage them seem rather limited (I don't know - I don't have access to them)
Having never used the interface (it's web based), you *really* have no room to talk about it. It works and allows people to do what they need to do. No, the switches are not "network managed" -- web interfaces alone don't count. However, they are cheap and get the job done.
... It's quite clear that nobody really likes these switches, and would like to buy other, better ones now we've run out of switch ports again.
Really? I've not seen anyone complaining about them. (I've not been sitting on IRC for awhile, tho') We're always going to run out of switch ports. *I* run out of ports in my own living room -- 'tho a $5,000 48 port 10/100/1000 managed switch would be nice, I'll stick with the $100 8 port one's from Linksys/Netgear/D-link.
... This way, admins are restricted to telling him what to do, whenever he has time for it and is on location. This really delays and complicates things, so I think it would be good to make sure we build a network makes REMOTE management as easy and flexible as possible, and keep required physical changes to a minimum.
Remote management doesn't really help here. Once a machine has been installed (and thus cabled to the required networks), rarely does one need any cables moved. The ability to virtually redesign the network is "neat", but ultimately most costly than useful. And considering the largest switches wiki is likely to ever aford are 48port models, there will always be an issue of machines connected to different switches. (trunking adds latency.)
This does require switches that are more expensive than the current ones, and it is rather hard to justify the cost for them.
And that's exactly why Wiki doesn't have a pair of Cisco 6513's. They are unnecessary.
However, with every new server and extra switch, remote manageability is getting harder, and consequently, the network is getting a mess.
Remote managability doesn't have anything to do with the mess. The mess is 100% human related -- near-constant, rapid, semi-haphazard planning. (Spiders tear down their web everyday and build a new one. Networks are more like cities, where the new one is built into/over the old.)
... It's barely documented, and they can't find out through the switch's management interfaces either.
And who's fault is the documentation? That's right, us monkeys moving cables without updating documentation. (we're all guilt of not labeling and documenting what we did in the closet.) And as you've not used the managment interface, you don't know if can show the MAC's known on each port. (Which comes back to people... do the admins know they can match up MAC addresses to tell what's on/behind each port?)
- serial ports, so we can manage them out of band through a console
server even if the network is down
Ah yes... one more set of ports to run out of. (which, btw, we are.)
- SNMP, so we can properly graph statistics of the switch itself, and
the individual ports. VERY helpful in case of problems...
SNMP... always a good thing, but it's not a enough of a justification for the cost.
- spanning tree (STP), especially helpful in large networks with remote
management
Do you even know what STP is and what it does? Yes, it's helpful in *physically* large networks where you may not know you've created a loop. If you cannot avoid creating loops in a network spanning two racks without spanning-tree, you need to put down the crimp tool, and go home; you're done.
You appear to be confusing "remote management" with "link redundancy." Spanning-tree allows one to have multiple physicals paths between switches while maintaining a single logical path between any two points. In effect, STP breaks loops: A connected to B connected to C connected back to A. And it allows for parallel links: A connected to B via 2 or more ports. STP does this by *blocking* one or more ports.
STP does not speed up the network by load sharing on parallel links. (Cisco calls that Etherchannel.) STP does not reduce latency. It's not OSPF for layer 2. STP is engineered to prevent loops; it does not find the best path between any two points -- 'tho one can spend months tuning port costs to drive STP to prefer specific links (I wouldn't do that unless I was really bored and tired of playng video games :-))
- VLANs and 802.1Q support
the current switches are vlan capable.
- Diagnostic information from the switch's console - port descriptions,
port statistics, port status, mac address information, vlan assignments, error rates, etc
all (or almost all) available from the web management interface.
- Syslog logging, so we notice what's going on
Such as? Switches generate almost no syslog traffic -- even when the network is coming apart. Very nasty things have to be happening for the switch to begin complaining. (aside from link up/down messages which are of some, but limited, value.)
- centralized administration, so we don't have to manually copy
everything to each and every switch
??? Each switch is independant. They get configured from someone typing in the config. Switches don't automatically clone themselves because they're next to another switch *grin* What you are suggesting is a management system *cough*Ciscoworks*cough* that generally costs as much as the switches they manage. As a network admin who's managed networks with dozens of cisco switches, no one uses the management system to manage the switches; we use the cli -- I even have scripts to send the same commands to a list of switches (and routers, and netblazers, anything with a telnet interface.) (And yes, I have shell scripts for SNMP archival of the running configs.)
- upgradeable firmware with long term support
Anything with a management interface is upgradable. And define "long term"? I see EOL/EOS notices from Cisco all the time. As I recall, both the 4000 and 6000 lines are EOL. The 5000/5500 line is nearing the end of hardware and software support -- having been EOL and EOS years ago.
- Port trunking/aggregation, for high bandwidth or redundancy needs
Note: this is almost *always* proprietary and restrictive. Yes, there are standards for this sort of thing, but cross-vendor trunking is still muddy water. And for some *cough*Cabletron*cough*, the parallel links must be nearly the same distance (within 1m as I recall) -- yep, it's that damned sensitive.
- IGMP/multicast support, could be helpful on a large network too
multicast is 99.999% useless to Wikimedia.
Wikimedia also needs SOME layer 3 and layer 4 features, but these are less important, and generally MUCH more expensive, so I don't think we can really justify to do this using switch hardware:
...
We could build the network out of a nice, decent core switch (possibly two for redundancy), and multiple, relatively cheap access switches to connect servers (for example, Cisco 2948G-GE-TX).
"Relatively cheap"? 100$ is a cheap switch. 400$ is a relatively cheap switch. 5000$ is not cheap - period. Yes, a 2948 is cheap_er_ than a fully loaded 6500, but it's still not cheap.
What's the functional differences between the 2948 and the existing netgear switches... 48 vs 24 ports. And that's about it. Both have remote management via web interfaces. The cisco also has a cli (telnet, ssh, and serial.) The cisco has SNMP managability and syslog capability which is nice, but not necessary and not worth the extra 4000$.
could build a large virtual switch by stacking multiple smaller ones
We're already "stacking" switches. That stacking is not what you think it is -- what it used to be: a backplane extension. Today's stack is merely a set of switches hung off another switch -- which is cascading. The 3000 series stack you point out is, in fact, a cisco proprietary firewire interlink of the switches. (thus locking in cisco hardware for ever more.)
Redundancy is something we need to think about. Of course we can buy one big and expensive switch, but what if it breaks? With multiple cheaper switches, it's more feasible to have one or two on spare.
If you buy a "bis ass switch", you buy a "bug ass support contract" to go with it. It breaks; they fix it, *quickly*. (I've never seen an entire catalyst switch die. I've seen or heard of every component short of the backplane failing... and everything except the backplane is easily and quickly replacable.)
This needs more discussion...
this gets discussed all the time...
Large vendors (Cisco, et.al.) are much more likely to donate gear to tax deductable charities. Wiki isn't one, yet. I have suggested talking to Cisco about getting some hardware donated -- cisco has alot of reclaimed hardware (from trade-ups) and referbished goodies. I don't think anyone would balk at paying for a support contract (2-3k$) for a donated 100k$ switch. It's a good tradeoff.
--Ricky
Ricky Beam wrote:
It's quite evident that wikimedia's current network is a mess.
It's always qualified as "a mess". Every network ends up looking like Wiki's at some point. No matter how well planned, documented, or managed, the wiring closet will eventually look like a huricane of cobwebs.
Oh? My networks must be exceptions then.
... However, noone seems to know what's on which ports.
$79 switch or $100,000 switch... if no one documents what's plugged into which port (and where all the cables go), you won't know what's plugged into which port. Comparing MAC addresses everytime you need to know where something is attached is very time consuming and error prone. *Maintain* the documentation. That's pretty easy as there's only one monkey movin' cables.
Agree. However, certain switch features make documenting these things a lot easier, and therefor increase likelyhood of responsible persons maintaining it.
... The interfaces to manage them seem rather limited (I don't know - I don't have access to them)
Having never used the interface (it's web based), you *really* have no room to talk about it. It works and allows people to do what they need to do. No, the switches are not "network managed" -- web interfaces alone don't count. However, they are cheap and get the job done.
Isn't that exactly what I stated?
Really? I've not seen anyone complaining about them. (I've not been sitting on IRC for awhile, tho')
Perhaps you should... I have.
We're always going to run out of switch ports. *I* run out of ports in my own living room -- 'tho a $5,000 48 port 10/100/1000 managed switch would be nice, I'll stick with the $100 8 port one's from Linksys/Netgear/D-link.
Yes, but you are not wikimedia, and you don't have tens of servers in your bedroom, *AND* have to manage them from your living room without being able to access them physically.
Remote management doesn't really help here. Once a machine has been installed (and thus cabled to the required networks), rarely does one need any cables moved. The ability to virtually redesign the network is "neat", but ultimately most costly than useful. And considering the largest switches wiki is likely to ever aford are 48port models, there will always be an issue of machines connected to different switches. (trunking adds latency.)
Indeed. That's why it's part of the discussion.
And I certainly think vlans are a lot more useful than they are costly.
This does require switches that are more expensive than the current ones, and it is rather hard to justify the cost for them.
And that's exactly why Wiki doesn't have a pair of Cisco 6513's. They are unnecessary.
Noone ever seriously talked about buying a 6513. We are talking about 2948s, which can do a lot more and are only a bit more expensive than the dumb netgears only you seem to like.
However, with every new server and extra switch, remote manageability is getting harder, and consequently, the network is getting a mess.
Remote managability doesn't have anything to do with the mess. The mess is 100% human related -- near-constant, rapid, semi-haphazard planning. (Spiders tear down their web everyday and build a new one. Networks are more like cities, where the new one is built into/over the old.)
Exactly, that's exactly how it is now. Why not try to improve it? It clearly isn't working...
And who's fault is the documentation? That's right, us monkeys moving cables without updating documentation. (we're all guilt of not labeling and documenting what we did in the closet.) And as you've not used the managment interface, you don't know if can show the MAC's known on each port. (Which comes back to people... do the admins know they can match up MAC addresses to tell what's on/behind each port?)
I beg your pardon. I may not have access myself, but I have talked to people who do (jeronim, kate, etc) a lot.
- serial ports, so we can manage them out of band through a console
server even if the network is down
Ah yes... one more set of ports to run out of. (which, btw, we are.)
Indeed. Why not change THAT instead?!
SNMP... always a good thing, but it's not a enough of a justification for the cost.
Not by itself, no. Together with the rest, yes.
- spanning tree (STP), especially helpful in large networks with remote
management
Do you even know what STP is and what it does? Yes, it's helpful in *physically* large networks where you may not know you've created a loop. If you cannot avoid creating loops in a network spanning two racks without spanning-tree, you need to put down the crimp tool, and go home; you're done.
[snip]
You seem to underestimate me by a fair bit.
Spanning-tree is, aside of redundancy, /also/ useful because you can change parameters on one uplink, while working off the other. Serial console is better of course...
- VLANs and 802.1Q support
the current switches are vlan capable.
I never said they didn't. But we can't transition to using vlans remotely, in a safe way.
- Diagnostic information from the switch's console - port descriptions,
port statistics, port status, mac address information, vlan assignments, error rates, etc
all (or almost all) available from the web management interface.
Then I wonder why you didn't discover about the faulty network setup with the trunked links, some time ago. Surely it was all available from the web interface?
- Syslog logging, so we notice what's going on
Such as? Switches generate almost no syslog traffic -- even when the network is coming apart. Very nasty things have to be happening for the switch to begin complaining. (aside from link up/down messages which are of some, but limited, value.)
Well I have solved quite a few nasty network problems just by looking at syslog messages from switches, but what do I know...
- centralized administration, so we don't have to manually copy
everything to each and every switch
??? Each switch is independant. They get configured from someone typing in the config. Switches don't automatically clone themselves because they're next to another switch *grin* What you are suggesting is a management system *cough*Ciscoworks*cough* that generally costs as much as the switches they manage. As a network admin who's managed networks with dozens of cisco switches, no one uses the management system to manage the switches; we use the cli -- I even have scripts to send the same commands to a list of switches (and routers, and netblazers, anything with a telnet interface.) (And yes, I have shell scripts for SNMP archival of the running configs.)
So you don't replicate things like accounts over RADIUS, vlan configs over VTP, etc? Your loss... I wasn't talking about Ciscoworks at all.
But, as I understand it you're going to write the scripts to do this over the web interfaces on the current switches? :) They don't have SNMP, they don't have a CLI either.
- upgradeable firmware with long term support
Anything with a management interface is upgradable. And define "long term"? I see EOL/EOS notices from Cisco all the time. As I recall, both the 4000 and 6000 lines are EOL. The 5000/5500 line is nearing the end of hardware and software support -- having been EOL and EOS years ago.
I never said it'd have to be Cisco.
- Port trunking/aggregation, for high bandwidth or redundancy needs
Note: this is almost *always* proprietary and restrictive. Yes, there are standards for this sort of thing, but cross-vendor trunking is still muddy water. And for some *cough*Cabletron*cough*, the parallel links must be nearly the same distance (within 1m as I recall) -- yep, it's that damned sensitive.
Which is just one of the reasons why I think we should stick with one vendor. Yes, I too would prefer open standards and compatibility, but that's just not where we are today.
And wikipedia *did* use trunked ports
- IGMP/multicast support, could be helpful on a large network too
multicast is 99.999% useless to Wikimedia.
Almost, yes. Ganglia works over multicast IIRC. But I agree, almost useless, and therefor mentioned last.
"Relatively cheap"? 100$ is a cheap switch. 400$ is a relatively cheap switch. 5000$ is not cheap - period. Yes, a 2948 is cheap_er_ than a fully loaded 6500, but it's still not cheap.
A 2948G-GE-TX is approx. $3,500, while 2 netgears cost ~ $1,500.
What's the functional differences between the 2948 and the existing netgear switches... 48 vs 24 ports. And that's about it. Both have remote management via web interfaces. The cisco also has a cli (telnet, ssh, and serial.) The cisco has SNMP managability and syslog capability which is nice, but not necessary and not worth the extra 4000$.
Extra $2,000, which makes it about twice as expensive. And yes, I certainly think that's worth it.
We're already "stacking" switches. That stacking is not what you think it is -- what it used to be: a backplane extension. Today's stack is merely a set of switches hung off another switch -- which is cascading.
I know the difference between cascading and stacking, thank you...
The 3000 series stack you point out is, in fact, a cisco proprietary firewire interlink of the switches. (thus locking in cisco hardware for ever more.)
Of course it's proprietary, but why is that a problem? It allows us to grow the "switch" along with the growing need for switch ports. And we can still cascade other vendor's switches, just like we do now.
If you buy a "bis ass switch", you buy a "bug ass support contract" to go with it. It breaks; they fix it, *quickly*. (I've never seen an entire catalyst switch die. I've seen or heard of every component short of the backplane failing... and everything except the backplane is easily and quickly replacable.)
Fine with me, and that's why it's a listed option. I don't think it'll be chosen, however...
This needs more discussion...
this gets discussed all the time...
Yes, but nothing ever seems to happen. And that's what I'm trying to change now, by request.
Ricky Beam wrote:
On Sun, 19 Dec 2004, Mark Bergsma wrote:
It's quite evident that wikimedia's current network is a mess.
It's always qualified as "a mess". Every network ends up looking like Wiki's at some point. No matter how well planned, documented, or managed, the wiring closet will eventually look like a huricane of cobwebs.
<snip>
Hello Ricky,
I do not really agree on this point. The company I worked for had a team dedicated to cable planning and installation.
* Different colors used for cables (IIRC: green for switches uplinks, red for consoles, white for internal network, yellow for outside). * Cables attached 8 by 8 on the border of the racks. * Every single cable had a unique number
The room had something like 80 servers on one side, about 50 more on an other sid, add some about 12 TNTs.
The best thing was(is) we had a photo of each racks (front and backside).
cheers,
Ricky Beam wrote:
Comparing MAC addresses everytime you need to know where something is attached is very time consuming and error prone. *Maintain* the documentation. That's pretty easy as there's only one monkey movin' cables.
Well, one problem we have is that the monkey (me) gets called out of the country with some increasing regularity. Another problem we have is that another monkey (Aaron, at the colo) is going to untangle the current mess of wires and neatly tie everything into the rack properly, but is waiting at the moment for us to decide about a switch solution.
I certainly agree that maintaining documentation is critical here, but at the same time I think it's pretty important that we are able to do this maintenance *remotely*. Certainly, I don't think anyone is advising that we "compare MAC addresses everytime we need to know" -- rather it's just that for debugging/troubleshooting/oddities, it will be great for us to have the *ability* to figure out which mac address is plugged in where.
Really? I've not seen anyone complaining about them. (I've not been sitting on IRC for awhile, tho') We're always going to run out of switch ports. *I* run out of ports in my own living room -- 'tho a $5,000 48 port 10/100/1000 managed switch would be nice, I'll stick with the $100 8 port one's from Linksys/Netgear/D-link.
We're looking at a cost differential for 48 ports of roughtly $1500 versus $3500. Even if we grow to 10x our current needs (480 ports added, let's say) the total cost differential would be "only" $20,000.
That's a large expense, to be sure, but not in the context of what our overall costs will be in such an environment.
Remote managability doesn't have anything to do with the mess. The mess is 100% human related -- near-constant, rapid, semi-haphazard planning. (Spiders tear down their web everyday and build a new one. Networks are more like cities, where the new one is built into/over the old.)
Yes, I agree with this -- near-constant, rapid, semi-haphazard planning. But at this point, it isn't clear that we have a *lot* of choice about that.
Additionally, it occurs to me that this current discussion should be viewed - in part - as an attempt to avoid haphazardness. The easy way forward is the blind way forward: "Oh well, out of ports, buy another couple of cheap switches". What I'm hearing from people, though, is that buying more capable switches will make it easier to do things rationally going forward.
And as you've not used the managment interface, you don't know if can show the MAC's known on each port. (Which comes back to people... do the admins know they can match up MAC addresses to tell what's on/behind each port?)
My understanding is that the admins know that they *can't* do this with the current switches. They could be mistaken. Can we confirm this?
It seems that there are a lot of "would be nice" features, but that the real "killer" here is the ability for the switch to tell us which mac address is plugged in where, so that we can reconfigure vlans. (And yes, I say this knowing full well that there's an alternative, which is for Monkey Jimbo to go over there and document everything and do all the rewiring himself and instruct the colo never to touch anything. But realistically having an option that doesn't involve me personally as a bottleneck is always a good idea...)
Large vendors (Cisco, et.al.) are much more likely to donate gear to tax deductable charities. Wiki isn't one, yet. I have suggested talking to Cisco about getting some hardware donated -- cisco has alot of reclaimed hardware (from trade-ups) and referbished goodies. I don't think anyone would balk at paying for a support contract (2-3k$) for a donated 100k$ switch. It's a good tradeoff.
I have no objection to us trying to get gear donated to us. This will be easier after the 501(c)(3) is confirmed by the IRS. But even then, it will take time.
We need more ports *now*, so we should go with a sensible solution that has a reasonable forward path.
--Jimbo
On Tue, 21 Dec 2004, Jimmy (Jimbo) Wales wrote:
Ricky Beam wrote:
Comparing MAC addresses everytime you need to know where something is attached is very time consuming and error prone. *Maintain* the documentation. That's pretty easy as there's only one monkey movin' cables.
Well, one problem we have is that the monkey (me) gets called out of the country with some increasing regularity. Another problem we have is that another monkey (Aaron, at the colo) is going to untangle the current mess of wires and neatly tie everything into the rack properly, but is waiting at the moment for us to decide about a switch solution.
I certainly agree that maintaining documentation is critical here, but at the same time I think it's pretty important that we are able to do this maintenance *remotely*. Certainly, I don't think anyone is advising that we "compare MAC addresses everytime we need to know" -- rather it's just that for debugging/troubleshooting/oddities, it will be great for us to have the *ability* to figure out which mac address is plugged in where.
It's not my intent to shame Jimbo (and any other wire pluggin' monkeys.) What we're dancing around here is called "Change Management" in the Enteprise. Having been away from the "Enterprise" for a little over a year now, I've blocked out the nightmare of proceedures and paperwork, but they exist for a very real reason... at some point, all the cooks in the kitchen need to know where everything is. Wiki passed that point some time ago.
It's very simple and easy to record what's plugged in where at the time it's plugged in. After the fact, chasing cables, is No Fun (tm). I've worked in the Telco/ISP world long enough to have a degree in Not Fun -- even in regulated spaces, the cabling standards are not always followed (the tracer line doesn't help if it isn't plugged in, or no one ran power to the DSX panel, or some [censored] monkey steels the fuse for that panel. Those two who've replied w.r.t. their "exception" networks don't have the breadth of experience that they may think they do.)
As for virtual network redesign... I would like to go on record saying that it's a Bad Idea (tm), as it's very easy to break the network in ways one cannot undo remotely. (see also: "haphazard network growth") It takes a great deal of care, patients, and planning to execute without incident. Sitting here in my apartment ("warehouse"), I'm several hundred miles from the colo (and not on the approved list.) If I were to break something, it could take hours to get someone there to fix it. (It'd take about 13hrs for me to drive to the colo :-)) And let's face it, remotely, there's never just *one* person doing something.
Granted, I've done this sort of thing. However, I've always done so with extensive planning and people on-site available to undo what I'm about to do -- i.e. power cycle the hardware. And yes, even with planning, I've broken the internet more than once -- 'tho never for very long, unlike some of my former coworkers who once left the ISP's network split in half (read: "fux0r3d up") for ~4hrs waiting for the techs to show up @ 7am and reset the router... they didn't call the on-call tech, they didn't dial into the POP, nor did they bother to login to our side of any of the uplinks (ssh access was not filtered for just such emergencies.)
We're looking at a cost differential for 48 ports of roughtly $1500 versus $3500. Even if we grow to 10x our current needs (480 ports added, let's say) the total cost differential would be "only" $20,000.
Wiki will need real Enterprise hardware long before that. In my opinion, at around 100 ports, it's time to stop hanging things off 79$ switches. The more switches in the mess, the greater the odds one of them will fail. I've seen about one 3com 10/100 hub fail every 2 months in the Enterprise Desktop world -- 20-30 per closet across 9 floors of the office building. (And I've seen one Acton hub *physcially* catch fire. One of the Charlotte switch tech drove it back to Raleigh for me. It sat like a trophey outside my desk for months. One of the PC Techs took it when I left.)
I don't think Wiki wants to look like the mom-n-pop ISP in Mooresville, NC, I was told about by a USR Sales Engineer many years ago... Imagine, if you will, a garage with over 100 USR Sportster desktop modems layered on 4x8 ft window screens stacked 7-8 layers high with a 4ft diameter, roof mount attic fan on top of them. I wish Jim had taken a picture of that rig :-) It beets Interpath's (ok, it was called "Global Radio" at the time) "Rabbit Cage" circa 1993 -- sheet metal frame to hold the circuit boards from 8 Microcom ES desktop modems running to a Mac II(?). When I started there in late '95, Charley showed me the Rabbit Cage -- "where we started". That had me laughing all week. (Yeah, I showed USR the cage in late '97; he was equally ammused. I think Charley still has that thing.)
Yes, I agree with this -- near-constant, rapid, semi-haphazard planning. But at this point, it isn't clear that we have a *lot* of choice about that.
Nope, that soup is already on the floor. And given the rate of growth, planning an expandable network to last more than 6months will be difficult. And wiki will out grow the current mess about twice before the planning would be done. I'm talking about several months of planning with several more of begging vendors for hardware. Wiki will be 501(c)3 by then.
Additionally, it occurs to me that this current discussion should be viewed - in part - as an attempt to avoid haphazardness. The easy way forward is the blind way forward: "Oh well, out of ports, buy another couple of cheap switches". What I'm hearing from people, though, is that buying more capable switches will make it easier to do things rationally going forward.
At this point, a few more cheap switches is the best-cost option. When one needs ports *now*, it's too late to invest time in proper planning. Remember the 6P thing... Prior Planning Prevents Piss Poor Performance. (I've not heard that since high-school :-))
The need/requirement for gigE ports to most/all machines complicates matters by greatly increasing costs and limiting the market. I've built medium sized (~100 port) networks for about the cost of one 2948G. (also built an ISP POP for under 100k$, got a chunk of marble for saving the company a bank worth of money.) But that's a 10/100 network with only 3 gigE ports.
(And yes, I say this knowing full well that there's an alternative, which is for Monkey Jimbo to go over there and document everything and do all the rewiring himself and instruct the colo never to touch anything. But realistically having an option that doesn't involve me personally as a bottleneck is always a good idea...)
Just document the wiring when the wire is installed. How many people sit on IRC while your at the colo? It's not hard to tell "us" what you've plugged in where, from which one of "us" can update the docs (both those in wiki and on the machines.) Unfortunately, this is part of the soup that's on the floor. It wasn't recorded when it was plugged in, so now, someone will have to physically trace the cables. (Have I mentioned how much I hate doing that?) There's no escaping it, it would appear.
--Ricky
Ricky Beam wrote:
Just document the wiring when the wire is installed. How many people sit on IRC while your at the colo? It's not hard to tell "us" what you've plugged in where, from which one of "us" can update the docs (both those in wiki and on the machines.) Unfortunately, this is part of the soup that's on the floor. It wasn't recorded when it was plugged in, so now, someone will have to physically trace the cables. (Have I mentioned how much I hate doing that?) There's no escaping it, it would appear.
Oh, it isn't as awful as all of that: most of it is done: http://wp.wikidev.net/Switches
This is approximately correct.
It's the "approximately" part that's the real bother, of course.
--Jimbo
wikitech-l@lists.wikimedia.org