2“The Janitor Pulled the Plug…” Why was he allowed near the equipment?Why was the problem noticed only afterwards?Why did it take 6 weeks to determine the problem?Why wasn’t there redundant power?Why wasn’t there network redundancy?
3Network Design and Architecture… … is of critical importance… contributes directly to the success of the network… contributes directly to the failure of the network“No amount of magic knobs will save a sloppily designed network”Paul Ferguson—Consulting Engineer,Cisco Systems
4What is a Well-Designed Network? A network that takes into consideration these important factors:Physical infrastructureTopological/protocol hierarchyScaling and RedundancyAddressing aggregation (IGP and BGP)Policy implementation (core/edge)Management/maintenance/operationsCost
5The Three-legged Stool Designing the network with resiliency in mindUsing technology to identify and eliminate single points of failureHaving processes in place to reduce the risk of human errorAll of these elements are necessary, and all interact with each otherOne missing leg results in a stool which will not standDesignTechnologyProcess
6New World vs. Old World Internet/L3 networks DesignNew World vs. Old WorldInternet/L3 networksBuild the redundancy into the systemTelco Voice and L2 networksPut all the redundancy into a boxInternet Networkvs.
7Internet Infrastructure DesignNew World vs. Old WorldDespite the change in the Customer Provider dynamic, the fundamentals of building networks have not changedISP Geeks can learn from Telco Bell Heads the lessons learned from 100 years of experienceTelco Bell Heads can learn from ISP Geeks the hard experience of scaling at +100% per yearTelco InfrastructureInternet Infrastructure
8DesignHow Do We Get There?“In the Internet era, reliability is becoming something you have to build, not something you buy. That is hard work, and it requires intelligence, skills and budget. Reliability is not part of the basic package.”Joel Snyder – Network World Test Alliance 1/10/2000“Reliability: Something you build, not buy”
11Modular/Structured Design Other ISPsOrganize the network into separate and repeatable modulesBackbonePoPHosting servicesISP ServicesSupport/NOCISP Services (DNS, Mail, News,FTP, WWW)Hosted ServicesBackbone Linkto Another PoPBackbone linkto Another PoPNetworkCoreConsumerDIAL AccessConsumer Cableand xDSL AccessNx64 CustomerAggregation LayerNxT1/E1 CustomerAggregation LayerNetworkOperationsCentreChannelised T1/E1 CircuitsNx64 Leased Line Circuit DeliveryT1/E1 Leased Line Circuit DeliveryChannellized T3/E3 Circuits
12Modular/Structured Design Modularity makes it easy to scale a networkDesign smaller units of the network that are then plugged into each otherEach module can be built for a specific function in the networkUpgrade paths are built around the modules, not the entire network
13Functional Design One Box cannot do everything (no matter how hard people have tried in the past)Each router/switch in a network has a well-defined set of functionsThe various boxes interact with each otherEquipment can be selected and functionally placed in a network around its strengthsISP Networks are a systems approach to designFunctions interlink and interact to form a network solution.
14Tiered/Hierarchical Design Flat meshed topologies do not scaleHierarchy is used in designs to scale the networkGood conceptual guideline, but the lines blur when it comes to implementation.Access LayerDistributionLayerOtherRegionsCore
15Multiple Levels of Redundancy DesignMultiple Levels of RedundancyTriple layered PoP redundancyLower-level failures are betterLower-level failures may trigger higher-level failuresL2: Two of everythingL3: IGP and BGP provide redundancy and load balancingL4: TCP re-transmissions recover during the fail-overIntra-POP InterconnectBorderBackboneAccessPoP Intraconnect
16Multiple Levels of Redundancy DesignMultiple Levels of RedundancyMultiple levels also mean that one must go deep – for example:Outside Cable plant – circuits on the same bundle – backhoe failuresRedundant power to the rack – circuit over load and technician tripMIT (maintenance injected trouble) is one of the key causes of ISP outage.
17Multiple Levels of Redundancy DesignMultiple Levels of RedundancyObjectives –As little user visibility of a fault as possibleMinimize the impact of any fault in any part of the networkNetwork needs to handle L2, L3, L4, and router failureBackbonePeer NetworksPoPLocationAccessResidential Access
18Multiple Levels of Redundancy DesignMultiple Levels of RedundancyCustomer’s IGPAccess 1Access 2NAS 1NAS 2NetFlow Collector and Syslog ServerSW 1OSPF Area 0 and iBGPNeighboring POPCore 1Core 2SW 2OSPF Area 200OSPF Originate-Default into POPOSPF Originate-Default into PoPDial-upDedicated AccessPOP Interconnect MediumPoP Service and ApplicationsCore Backbone Router
20The Basics: Platform Redundant Power Redundant Cooling DesignThe Basics: PlatformRedundant PowerTwo power suppliesRedundant CoolingWhat happens if one of the fans fail?Redundant route processorsConsideration also, but less importantPartner router device is betterRedundant interfacesRedundant link to partner device is better
21The Basics: Environment DesignThe Basics: EnvironmentRedundant PowerUPS source – protects against grid failure“Dirty” source – protects against UPS failureRedundant cablingCable break inside facility can be quickly patched by using “spare” cablesFacility should have two diversely routed external cable pathsRedundant CoolingFacility has air-conditioning backup…or some other cooling system?
23Bad Architecture (1) A single point of failure Single collision domain DesignBad Architecture (1)A single point of failureSingle collision domainSingle security domainSpanning tree convergenceNo backupCentral switch performanceHSRPSwitchDial NetworkServer FarmISP Office LAN
24Customer Hosted Services DesignBad Architecture (2)A central routerSimple to buildResilience is the “vendor’s problem”More expensiveNo router is resilient against bugs or restartsYou always need a bigger routerUpstream ISPDial NetworkRouterCustomer Hosted ServicesCustomer linksISP Office LANServer farm
25Even Worse!! Avoid Highly Meshed, Non-Deterministic Large Scale L2 DesignEven Worse!!Avoid Highly Meshed, Non-Deterministic Large Scale L2Building 3Building 4Building 1Building 2Where Should Root Go?What Happens when Something Breaks?How Long to Converge?Many Blocking LinksLarge Failure Domain!Broadcast FloodingMulticast FloodingLoops within LoopsSpanning Tree Convergence TimeTimes 100 VLANs?
26Typical (Better) Backbone DesignTypical (Better) BackboneAccess L2Client BlocksDistribution L3Still a Potential for Spanning Tree Problems, but Now the Problems Can Be Approached Systematically, and the Failure domain Is LimitedBackbone Ethernet orATM Layer 2Distribution L3Server BlockAccess L2Server Farm
27The best architecture Client Core L3 Server farm multiple subnetworks DesignThe best architectureAccess L2ClientDistribution L3multiple subnetworksHighly hierarchicalControlled Broadcast and MulticastCore L3Distribution L3Server farmAccess L2
30Multi-homed Servers L3 (router) Core L3 (router) Distribution L2 TechnologyMulti-homed ServersUsing Adaptive Fault Tolerant” Drivers and NICsNIC Has a Single IP/MAC Address (Active on one NIC at a Time)When Faulty Link Repaired, Does Not Fail Back to Avoid FlappingFault-tolerant Drivers Available from Many Vendors: Intel, Compaq, HP, SunMany Vendors also Have Drivers that also Support etherchannelL3 (router)CoreL3 (router)DistributionL2Switch1Dual-homed Server—Primary NIC Recovery (Time 1–2 Seconds)Server Farm
31HSRP – Hot Standby Router Protocol TechnologyHSRP – Hot Standby Router Protocol00:10:7B:04:88:CC00:10:7B:04:88:BB00:00:0C:07:AC:01default-gw =Transparent failover of default router“Phantom” router createdOne router is active, responds to phantom L2 and L3 addressesOthers monitor and take over phantom addresses
32TechnologyHSRP – RFC 2281Router Group #1HSR multicasts hellos every 3 sec with a default priority of 100HSR will assume control if it has the highest priority and preempt configured after delay (default=0) secondsHSR will deduct 10 from its priority if the tracked interface goes downPrimaryStandbyStandbyPrimaryStandbyRouter Group #2
33HSRP Internet or ISP Backbone Server Systems Technology Router1: interface ethernet 0/0ip addressstandby 10 ipInternet or ISPBackboneRouter2:interface ethernet 0/0ip addressstandby 10 priority 150 pre-empt delay 10standby 10 ipstandby 10 track serial 0 60Router 1Router 2Server Systems
35DesignCircuit DiversityHaving backup PVCs through the same physical port accomplishes little or nothingPort is more likely to fail than any individual PVCUse separate portsHaving backup connections on the same router doesn’t give router independenceUse separate routersUse different circuit provider (if available)Problems in one provider network won’t mean a problem for your network
36DesignCircuit DiversityEnsure that facility has diverse circuit paths to telco provider or providersMake sure your backup path terminates into separate equipment at the service providerMake sure that your lines are not trunked into the same paths as they traverse the networkTry and write this into your Service Level Agreement with providers
37Service Provider Network TechnologyCircuit DiversityCustomerTHIS is better than….CustomerTHIS, which is better than….Service Provider NetworkCustomerTHISWhoops. You’ve been trunked!
38Circuit Bundling – MUX Use hardware MUX TechnologyCircuit Bundling – MUXUse hardware MUXHardware MUXes can bundle multiple circuits, providing L1 redundancyNeed a similar MUX on other end of linkRouter sees circuits as one linkFailures are taken care of by the MUXUsing redundant routers helpsWANMUXMUX
39Circuit Bundling – MLPPP TechnologyCircuit Bundling – MLPPPinterface Multilink1ip addressppp multilinkmultilink-group 1!interface Serial1/0no ip addressencapsulation pppinterface Serial1/1Multi-link PPP with proper circuit diversity, can provide redundancy.Router based rather than dedicated hardware MUXMLPPP Bundle
40DesignLoad SharingLoad sharing occurs when a router has two (or more) equal cost paths to the same destinationEIGRP also allows unequal-cost load sharingLoad sharing can be on a per-packet or per-destination basis (default: per-destination)Load sharing can be a powerful redundancy technique, since it provides an alternate path should a router/path fail
41Load Sharing OSPF will load share on equal-cost paths by default TechnologyLoad SharingOSPF will load share on equal-cost paths by defaultEIGRP will load share on equal-cost paths by default, and can be configured to load share on unequal-cost paths:Unequal-cost load-sharing is discouraged; Can create too many obscure timing problems and retransmissionsrouter eigrp 111networkvariance 2
42TechnologyPolicy-based RoutingIf you have unequal cost paths, and you don’t want to use unequal-cost load sharing (you don’t!), you can use PBR to send lower priority traffic down the slower path! Policy map that directs FTP-Data! out the Frame Relay port. Could! use set ip next-hop insteadroute-map FTP_POLICY permit 10 match ip address 6 set interface Serial1.1!! Identify FTP-Data trafficaccess-list 6 permit tcp any eq 20 any! Policy maps are applied against! inbound interfacesinterface ethernet 0 ip policy route-map FTP_POLICYFTP ServerFrame Relay128KATM 2M
43DesignConvergenceThe convergence time of the routing protocol chosen will affect overall availability of your WANMain area to examine is L2 design impact on L3 efficiency
45OSPF – Hierarchical Structure DesignOSPF – Hierarchical StructureBackboneArea #0Area #1Area #2Area #3ABRTopology of an area is invisible from outside of the areaLSA flooding is bounded by areaSPF calculation is performed separately for each area19
46Factors Assisting Protocol Convergence DesignFactors Assisting Protocol ConvergenceKeep number of routing devices in each topology area small (15 – 20 or so)Reduces convergence time requiredAvoid complex meshing between devices in an areaTwo links are usually all that are necessaryKeep prefix count in interior routing protocols smallLarge numbers means longer time to compute shortest pathUse vendor defaults for routing protocol unless you understand the impact of “twiddling the knobs”Knobs are there to improve performance in certain conditions only
48PoP Design One router cannot do it all Redundancy redundancy redundancyMost successful ISPs build two of everythingTwo smaller devices in place of one larger device:Two routers for one functionTwo switches for one functionTwo links for one function
49PoP Design Two of everything does not mean complexity Avoid complex highly meshed network designsHard to runHard to debugHard to scaleUsually demonstrate poor performance
50PoP Design – Wrong Design External BGP Peering Neighboring PoP Big RouterDedicated AccessBig SWBig NASBig ServerPSTN/ISDNWeb Services
52Hubs vs. Switches Hubs These are obsolete TechnologyHubs vs. SwitchesHubsThese are obsoleteSwitches cost little moreTraffic on hub is visible on all portsIt’s really a replacement for coax ethernetSecurity!?Performance is very low10Mbps shared between all devices on LANHigh traffic from one device impacts all the othersUsually non-existent management
53Hubs vs. Switches Switches Each port is masked from the other TechnologyHubs vs. SwitchesSwitchesEach port is masked from the otherHigh performance10/100/1000Mbps per portTraffic load on one port does not impact other ports10/100/1000 switches are commonplace and cheapChoose non-blocking switches in corePacket doesn’t have to wait for switchManagement capability (SNMP via IP, CLI)Redundant power supplies are useful to have
54Beware Static IP Dial Problems Solutions Does NOT scale DesignBeware Static IP DialProblemsDoes NOT scaleCustomer /32 routes in IGP – IGP won’t scaleMore customers, slower IGP convergenceSupport becomes expensiveSolutionsRoute “Static Dial” customers to same RAS or RAS group behind distribution routerUse contiguous address blockMake it very expensive – it costs you money to implement and support
56Network Operations Centre ProcessNetwork Operations CentreNOC is necessary for a small ISPIt may be just a PC called NOC, on UPS, in equipment room.Provides last resort access to the networkCaptures log information from the networkHas remote access from outsideDialup, SSH,…Train staff to operate itScale up the PC and support as the business grows
57Operations A NOC is essential for all ISPs ProcessOperationsA NOC is essential for all ISPsOperational Procedures are necessaryMonitor fixed circuits, access devices, serversIf something fails, someone has to be toldEscalation path is necessaryIgnoring a problem won’t help fixing it.Decide on time-to-fix, escalate up reporting chain until someone can fix it
58Operations Modifications to network ProcessOperationsModifications to networkA well designed network only runs as well as those who operate itDecide and publish maintenance schedulesAnd then STICK TO THEMDon’t make changes outside the maintenance period, no matter how trivial they may appear
59In SummaryDesignImplementing a highly resilient IP network requires a combination of the proper process, design and technology“and now abideth design, technology and process, these three; but the greatest of these is process”And don’t forget to KISS!Keep It Simple & Stupid!TechnologyProcess
60AcknowledgementsThe materials and Illustrations are based on the Cisco Networkers’ PresentationsPhilip Smith of Cisco SystemsBrian Longwe of Inhand .Ke