Download presentation
Presentation is loading. Please wait.
Published byGyles Perry Modified over 9 years ago
1
Protocol implementation Next-hop resolution Reliability and graceful restart
2
What is a next-hop The destination of the packets I am sending –Not the same as the interface –An ethernet interface will have many nodes behind it –Directly connected next hop is 1 hop away E.g. RSVP sends a PATH message to the next downstream node –Next hop may be directly connected (strict ERO) –Or not (loose ERO) OSPF sends an LS update to the other end of a link or a neighbor on an eithernet –Always directly connected BGP has an iBGP-next hop for each of its paths –Not directly connected
3
Next-hop If the next hop is not directly connected the way to reach it depends on the IGP –May change when IGP routing changes –Will have to use a different interface to reach it –Need to keep track of these changes Next hop resolution
4
Periodic resolution –may take a bit more time But next-hops will not be too many Or will they? Tunnels, VLANs … –Quagga uses this approach Through the IPV4_LOOKUP_NEXTHOP command Registration/notification –RSVP would tell zebra which nexthops it is interested in –Zebra will notify RSVP when something changes in the IGP path to it Better scaling for RSVP Difficult to ensure good scaling inside zebra –Various protocols may register 1000s of next hops More complex code in zebra
5
Network Reliability Availability: How many nines? –99.999% is 5.26 min down time/year –99.9999% is 31.5 sec down time/year Telephone networks are between 5 and 6 nines –Internet will have to get there –Currently at 4 nines? (vendors claim 5) –Very important with the new types of traffic Voip, Ipvt What can go wrong (% of failures for US telephone network ca. 1992): –Hardware failures (19%) –Software failures (14%) –Human errors (49%) –Vandalism/Terrorism –Acts of nature (11%) –Overload (6% but had the largest impact on customers)
6
Hardware failures Link failures –Protocols can cope with that Re-route, may be slow More aggressive repair methods –we will see them later Router failures –Can not do much just add redundancy Power supplies, fans, disks, etc –Line-card failure is similar to a link failure –Control processor failure is more serious Always have two of them Primary and backup
7
Modern Router architectures Dual controllers –For running the control plane Multiple line-cards –Can operate without the controllers –Router can forward traffic even when the control plane crashes –Called non-stop forwarding or head-less operation
8
Software failures When primary fails start using backup –Switchover Must be as fast as possible –Things in the network change in the meanwhile –Need to minimize this window What happens with the control software –Need to keep primary and backup instance in sync –How tight is this synchronization?
9
Tight synchronization Both primary and backup are active, keep them in sync by: Send them both the same input (I.e. duplicate control packets) –Fastest possible switchover –Expensive, may need to duplicate packets –Does not work for TCP based protocols The primary keeps sending state updates to the backup –May need to send too many messages Being totally in-sync is not easy –Needs transactional communication
10
Loose synchronization Backup is idle –But we keep configuration up to date –Each configuration change on the primary is mirrored on the backup Backup instance is started when the primary fails –Switchover will take longer Much-much simpler –Configuration changes are much less Variation: –Keep only the RIB process in sync in both primary and backup
11
Non-stop forwarding Key concept –forwarding happens in the line cards –Even if control processor fails forwarding can continue –Non stop forwarding, head-less operation Old Common sense: when router s/w crashes do not use the router –But with head-less operation it is ok to continue using routers that their s/w crashed –Assuming their s/w will be operational again soon
12
Special Case Planned restart –For s/w upgrade These are a significant percentage of downtime –For refresh Memory is leaking but s/w still operational Restart to get a clean start I can use graceful restart
13
Graceful Restart Other routers in the network will keep using a neighbor router –Even if is looks like its control plane has failed –Assuming it will come back soon Needs coordination –The failed router needs to do some special processing when it comes back –It has to tell its neighbors first that it supports graceful restart Zero impact on the network –The failed router will have the chance to restart its s/w and come back –Nobody in the rest of the network will know that something happened
14
How does it work Used for all protocols by now –OSPF, BGP, RSVP-TE… The neighbor will discover that the router is dead or it has restarted –HELLO timeout, different information in the HELLOs etc… –But will ignore it for a certain time period If the failed router comes back within this period –It will re-sync its state (database exchange for OSPF, resend all the LSPs for RSVP, …) –And all is back to normal
15
Example RSVP Use HELLOs Special recovery label messages Restarting router needs to remember the labels it allocated before the crash –Where? Shared memory recover them from the forwarding plane –Why? Must use the same labels again Must make sure it does not use an allocated label for some other LSP
16
Example OSPF Trick is to re-establish the adjacencies after a failure Remember the set of neighbors –Shared memory or in the backup controller After restart do not originate any LSAs Just re-establish adjacencies and re-sync database
17
Graceful restart catches All routers in the network should implement this to work Mostly for planned restarts: –S/w upgrades –Refreshes (if a router runs low on memory) –But it is possible to use for crashes too! It can not work if something changes in the network while the restart is going on –There may be routing loops
18
Router self-monitoring Automatically restart failed or stuck processes A separate monitor process –Keeps an eye on other processes –If there is a failure the failed process is restarted Of course it may fail again –Heart-beats to determine liveness –Failure may not necessarily be a crash Could be a software bug that causes an infinite loop or very-very slow processing
19
Why is it important Remember the PoP structure –Need dual routers for reliability –If I had a single router that was extra-reliable I could save a lot of money
20
Issues Strict Isolation –VMs –Other methods Global resource coordination –For example memory
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.