Re: [Wikitech-l] The big wikimedia network design/hardware discussion

19 Dec 2004

      On Sun, 19 Dec 2004, Mark Bergsma wrote:
...
It's quite evident that wikimedia's current network is a mess.
It's always qualified as "a mess".  Every network ends up looking like
Wiki's at some point.  No matter how well planned, documented, or
managed, the wiring closet will eventually look like a huricane of
cobwebs.
...
... However, noone seems to know what's on which
ports.
$79 switch or $100,000 switch... if no one documents what's plugged
into which port (and where all the cables go), you won't know what's
plugged into which port.  Comparing MAC addresses everytime you need
to know where something is attached is very time consuming and error
prone.  *Maintain* the documentation.  That's pretty easy as there's
only one monkey movin' cables.
...
... The interfaces to manage them seem rather limited (I don't know -
I don't have access to them)
Having never used the interface (it's web based), you *really* have
no room to talk about it.  It works and allows people to do what
they need to do.  No, the switches are not "network managed" --
web interfaces alone don't count.  However, they are cheap and get
the job done.
...
... It's quite clear that nobody really likes these
switches, and would like to buy other, better ones now we've run out of
switch ports again.
Really?  I've not seen anyone complaining about them. (I've not been
sitting on IRC for awhile, tho')  We're always going to run out of
switch ports.  *I* run out of ports in my own living room -- 'tho
a $5,000 48 port 10/100/1000 managed switch would be nice, I'll
stick with the $100 8 port one's from Linksys/Netgear/D-link.
...
... This way, admins are restricted
to telling him what to do, whenever he has time for it and is on
location. This really delays and complicates things, so I think it would
be good to make sure we build a network makes REMOTE management as easy
and flexible as possible, and keep required physical changes to a minimum.
Remote management doesn't really help here.  Once a machine has been
installed (and thus cabled to the required networks), rarely does one
need any cables moved.  The ability to virtually redesign the network
is "neat", but ultimately most costly than useful.  And considering
the largest switches wiki is likely to ever aford are 48port models,
there will always be an issue of machines connected to different
switches. (trunking adds latency.)
...
This does require switches that are more expensive than the current
ones, and it is rather hard to justify the cost for them.
And that's exactly why Wiki doesn't have a pair of Cisco 6513's.  They
are unnecessary.
...
However, with every new server and extra switch, remote manageability is
getting harder, and consequently, the network is getting a mess.
Remote managability doesn't have anything to do with the mess.  The mess
is 100% human related -- near-constant, rapid, semi-haphazard planning.
(Spiders tear down their web everyday and build a new one.  Networks
 are more like cities, where the new one is built into/over the old.)
...
... It's barely documented, and they
can't find out through the switch's management interfaces either.
And who's fault is the documentation?  That's right, us monkeys moving
cables without updating documentation. (we're all guilt of not labeling
and documenting what we did in the closet.)  And as you've not used the
managment interface, you don't know if can show the MAC's known on each
port. (Which comes back to people... do the admins know they can match
up MAC addresses to tell what's on/behind each port?)
...

serial ports, so we can manage them out of band through a console

server even if the network is down
Ah yes... one more set of ports to run out of. (which, btw, we are.)
...

SNMP, so we can properly graph statistics of the switch itself, and

the individual ports. VERY helpful in case of problems...
SNMP... always a good thing, but it's not a enough of a justification
for the cost.
...

spanning tree (STP), especially helpful in large networks with remote

management
Do you even know what STP is and what it does?  Yes, it's helpful in
*physically* large networks where you may not know you've created a
loop.  If you cannot avoid creating loops in a network spanning two
racks without spanning-tree, you need to put down the crimp tool, and
go home; you're done.
You appear to be confusing "remote management" with "link redundancy."
Spanning-tree allows one to have multiple physicals paths between switches
while maintaining a single logical path between any two points.  In effect,
STP breaks loops: A connected to B connected to C connected back to A.
And it allows for parallel links: A connected to B via 2 or more ports.
STP does this by *blocking* one or more ports.
STP does not speed up the network by load sharing on parallel links.  (Cisco
calls that Etherchannel.)  STP does not reduce latency.  It's not OSPF for
layer 2.  STP is engineered to prevent loops; it does not find the best path
between any two points -- 'tho one can spend months tuning port costs to
drive STP to prefer specific links (I wouldn't do that unless I was really
bored and tired of playng video games :-))
...

VLANs and 802.1Q support

the current switches are vlan capable.
...

Diagnostic information from the switch's console - port descriptions,

port statistics, port status, mac address information, vlan assignments,
error rates, etc
all (or almost all) available from the web management interface.
...

Syslog logging, so we notice what's going on

Such as?  Switches generate almost no syslog traffic -- even when the network
is coming apart.  Very nasty things have to be happening for the switch to
begin complaining. (aside from link up/down messages which are of some, but
limited, value.)
...

centralized administration, so we don't have to manually copy

everything to each and every switch
??? Each switch is independant.  They get configured from someone typing in
the config.  Switches don't automatically clone themselves because they're
next to another switch *grin*  What you are suggesting is a management
system *cough*Ciscoworks*cough* that generally costs as much as the switches
they manage.  As a network admin who's managed networks with dozens of
cisco switches, no one uses the management system to manage the switches;
we use the cli -- I even have scripts to send the same commands to a list
of switches (and routers, and netblazers, anything with a telnet interface.)
(And yes, I have shell scripts for SNMP archival of the running configs.)
...

upgradeable firmware with long term support

Anything with a management interface is upgradable.  And define "long term"?
I see EOL/EOS notices from Cisco all the time.  As I recall, both the 4000
and 6000 lines are EOL.  The 5000/5500 line is nearing the end of hardware
and software support -- having been EOL and EOS years ago.
...

Port trunking/aggregation, for high bandwidth or redundancy needs

Note: this is almost *always* proprietary and restrictive.  Yes, there
are standards for this sort of thing, but cross-vendor trunking is still
muddy water.  And for some *cough*Cabletron*cough*, the parallel links
must be nearly the same distance (within 1m as I recall) -- yep, it's
that damned sensitive.
...

IGMP/multicast support, could be helpful on a large network too

multicast is 99.999% useless to Wikimedia.
...
Wikimedia also needs SOME layer 3 and layer 4 features, but these are
less important, and generally MUCH more expensive, so I don't think we
can really justify to do this using switch hardware:
...
...
We could build the network out of a nice, decent core switch (possibly
two for redundancy), and multiple, relatively cheap access switches to
connect servers (for example, Cisco 2948G-GE-TX).
"Relatively cheap"?  100$ is a cheap switch.  400$ is a relatively cheap
switch.  5000$ is not cheap - period.  Yes, a 2948 is cheap_er_ than a
fully loaded 6500, but it's still not cheap.
What's the functional differences between the 2948 and the existing netgear
switches... 48 vs 24 ports.  And that's about it.  Both have remote management
via web interfaces.  The cisco also has a cli (telnet, ssh, and serial.)
The cisco has SNMP managability and syslog capability which is nice, but not
necessary and not worth the extra 4000$.
...
could build a large virtual switch by stacking multiple smaller ones
We're already "stacking" switches.  That stacking is not what you think
it is -- what it used to be: a backplane extension.  Today's stack is
merely a set of switches hung off another switch -- which is cascading.
The 3000 series stack you point out is, in fact, a cisco proprietary
firewire interlink of the switches. (thus locking in cisco hardware for
ever more.)
...
Redundancy is something we need to think about. Of course we can buy one
big and expensive switch, but what if it breaks? With multiple cheaper
switches, it's more feasible to have one or two on spare.
If you buy a "bis ass switch", you buy a "bug ass support contract" to go
with it.  It breaks; they fix it, *quickly*. (I've never seen an entire
catalyst switch die.  I've seen or heard of every component short of the
backplane failing... and everything except the backplane is easily and
quickly replacable.)
...
This needs more discussion...
this gets discussed all the time...
Large vendors (Cisco, et.al.) are much more likely to donate gear to tax
deductable charities.  Wiki isn't one, yet.  I have suggested talking to
Cisco about getting some hardware donated -- cisco has alot of reclaimed
hardware (from trade-ups) and referbished goodies.  I don't think anyone
would balk at paying for a support contract (2-3k$) for a donated 100k$
switch.  It's a good tradeoff.
--Ricky

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] The big wikimedia network design/hardware discussion