Network Survivability, Reliability and Availability: Protection & Restoration Zilong Ye, Ph.D. zye5@calstatela.edu.

Network Survivability, Reliability and Availability: Protection & Restoration
Zilong Ye, Ph.D.

Reliability Reliability is the probability that a system or component will operate without any service-affecting failure for a period of time t Reliability is a monotonically decreasing probability function of time, R(t) A specific reliability number always implies an assumed duration of time Reliability is about How soon the next repair expenses might be incurred etc. but reliability itself does not consider the repeated cycles of failure repair time, and return to service which determine the availability of an ongoing service

Availability Availability is the probability that a system will be found in the operating state at a random time in the future Availability inherently reflects a statistical equilibrium between failure processes: mean time to (or between) failure (MTTF/MTBF) and repair processes: mean time to repair (MTTR) in maintained repairable systems that are returned to the operating state following any failure MTTF Availability = MTTF + MTTR

Quantification of Availability
Percent Availability N-Nines Downtime Time Minutes/Year 99% 2-Nines 5,000 Min/Yr 99.9% 3-Nines 500 Min/Yr 99.99% 4-Nines 50 Min/Yr 99.999% 5-Nines 5 Min/Yr % 6-Nines .5 Min/Yr

Survivability Survivability of a network as a whole is
the average fraction of failed working capacity that can be restored by a specified mechanism within the spare capacity provided in a network A link may be fully survived with 100% capacity or it could be partially survived with <100% capacity

Market Drivers for Survivability
Customer Relations Competitive Advantage Revenue Negative - Tariff Rebates Positive - Premium Services Business Customers Medical Institutions Government Agencies Impact on Operations Minimize Liability

Failure Types & Other Motivations
Types of failure: Components: links, nodes, channels in WDM, active components, software… Human error: backhaul fiber cut Fiber inside oil/gas pipelines less likely to be cut Systems: Entire COs can fail due to catastrophic events deliberate attacks

Network Survivability: drivers
Availability: % (5 nines)  less than 5 min downtime per year Since a network is made up of several components, the ONLY way to reach 5-nines is to add survivability in the face of failures… - Survivability = continued services in the presence of failures Protection switching or restoration: mechanisms used to ensure survivability - Add redundant capacity, detect faults and automatically re-route traffic around the failure Protection: fast time-scale: 10s-100s of ms… - implemented in a distributed manner to ensure fast restoration Restoration: related term, but slower time-scale

Types of Fault-Recovery Mechanisms
Protection Backup resources (routes and wavelengths) pre- computed and reserved in advance (before a failure occurs) – simple but 50% overhead Faster recovery time What if pre-reserved resources also fail? Restoration Routes and wavelengths discovered dynamically after detection of a failure Resources allocation based current network state info More resource efficient Can recover as long as there’re redundant resources Slower recovery time 10

Restoration Path Restoration Link Restoration
Route can be computed after failure Link Restoration Path is discovered at the end nodes of the failed link More practical than path restoration Advantages & Disadvantages of Restoration Usually can recover from multiplex element faults More efficient usage of resource Complex Slow: require extra process time to setup path and reserve resource

Comparison between Protection & Restoration
Characteristic: Protection -- the resource are reserved before the failure, they may be not used; Restoration -- the resource are reserved and used after the failure Route: Protection -- predetermined; Restoration - - can be dynamically computed Resource Efficiency: Protection -- Low; Restoration -- High

Comparison between Protection &
Restoration (Cont’) Time used: Protection -- Short; Restoration -- Long Reliability: Protection -- mainly for single fault; Restoration -- can survive under multiple faults Implementation: Protection -- Simple; Restoration -- Complex

Network Survivability Architectures
Restoration Protection Self-healing Network Re-Configurable Network Protection Switching Linear Protection Architectures Ring Protection Architectures Mesh Restoration Architectures Path-based Link-based Segment-based

Restoration in Mesh Networks
Central Controller DCS DC DCS DC DCS Self Healing (distributed) Restoration Architecture Probing after restoration DC DCS DCS DCS DCS Reconfigurable (or Rerouting) Restoration Architecture (centralized) DC = Distributed Controller

Protection Switching Terminology
1+1 architectures - permanent bridge at the source - select at sink m:n architectures - m entities provide protection for n working entities where m is less than or equal to n allows unprotected extra traffic on the m entities most common - SONET linear 1:1 and 1:n Coordination Protocol - provides coordination between controllers in source and sink Required for all m:n architectures Not required for 1+1 architectures

Basic Ideas: Working and Backup Paths

Protection Switching: Terminology
Dedicated vs Shared: working connection assigned dedicated or shared protection bandwidth 1+1 is dedicated, 1:n is shared Revertive vs Non-revertive: after failure is fixed, traffic is automatically or manually switched back Shared protection schemes are usually revertive

Different protection and restoration schemes
Protection/Restoration Schemes Protection Restoration Path Ring Protection Mesh Protection Ring/Mesh Protection Link Restoration Restoration Link Path Link Path Protection Protection Link Protection Path Protection Protection Protection 21

Path vs Link Protection
22 Working Path DCS DCS Line or Link Protection DCS DCS DCS DCS Protection Path Control: Centralized or Distributed Route Calculation: Preplanned or Dynamic Type of Alternate Routing: Line or Path

Path Protection Two link (node) disjoint paths: primary (working) and backup (protection) path Traffic rerouted through a link-disjoint backup route once a link failure occurs on working path Usually, less resource required (using shorter routes) Lower end-to-end propagation delay for the recovered route Backup path pre-reserved or pre-set up Backup paths of different connections may or may not share common wavelengths on common links

Dedicated Path Protection Do not allow sharing among backup paths (resources) Backup paths pre-configured No switch configuration necessary along the backup path when a failure occurs Fast recovery time Resources not efficiently utilized (100% redundancy)

Shared Path Protection Allow sharing among backup paths subject to certain constraints Primary/active paths (AP) are link disjoint  backup paths (BP) may share common link and wavelength Backup paths configured when a failure occurs since backup paths may be shared; cannot commit resources to a particular primary in advance Slower recovery time Resources utilization much better More signaling required to recover from the failure

Shared Path Protection
If and only if two APs are disjoint, their BPs can share backup bandwidth (backup bandwidth) on a common link (i.e., total backup bandwidth = max{w1, w2}). AP1(w1) S1 D1 BP1 Link L(max{w1,w2}) BP2 S2 D2 AP2(w2)

Link Protection a light-path set up on a primary path For each link on the primary path, a backup detour is reserved around the link No sharing – dedicated-link protection Wavelength used on backup loop dedicated to specific link to be protected Shared-link protection Note, different connections on the same link might have different backup detours for that link

Solution 1: Active-Path First
Find an active path (AP) first Then find a disjoint backup path (BP) How? Remove the physical links and resources that the active path travels, and then re- run the routing algorithm to find the backup path. 60

Solution 2: Joint Path Selection
Select the active path and backup path in a joint manner. Joint optimization have better performance compared to active path first schemes in terms of the amount of network resources required How? Use Suurballe’s algorithm to compute two link- disjoint paths between (s, d) simultaneously 60

Suurballe’s Algorithm
Given a graph G=(V, E), find a pair of edge-disjoint paths from s to t such that the total edge cost of the two paths is minimal among all such path pairs

Dedicated-path protection – Heuristic Algorithms
Remove links that do not have free wavelengths Apply Suurballe’s algorithm to find a pair of paths Choose the shorter path as the primary path and the longer path as the backup Assign a wavelength using First-Fit to each path Guarantees the minimum total bandwidth (TBW) = active BW (ABW) + backup BW (backup bandwidth) for this request 30

Shared-Path Protection
Heuristic 1 Use Suurballe’s algorithm to generate two routes Assign wavelengths while trying to share the wavelengths on the backup paths as much as possible Does not perform well in backup path sharing since routing does not consider wavelength info no backup bandwidth sharing potential

A Fast and Efficient Heuristic 3
Challenges Jointly optimize an AP/BP pair with shared path protection is NP-hard using ILP is notoriously time consuming. also, only guarantee minimal TBW for each request, but not minimal TBW for all requests. Heuristics such as active path first (APF) can only achieve sub-optimal results: does not consider the yet-to-be-incurred backup cost along the BP when selecting a (shortest) AP

Potential Backup Cost (PBC)
Uses a shortest path algorithm to find the AP first But, in selecting the AP, each capable link e (Re≥w) will be assigned a cost of w+e(w), where the second term is the potential backup cost (PBC) then finds a shortest BP combines the best of ILP and APF based approaches See Xu et al, Lightwave Technology, Journal of Volume: 25 Issue: 8, 2251 – 2259, 2007

Protection in SRLG networks

Shared Risk Link Group (SRLG)
Widely recognized as an important concept in survivable optical networks A group of network links that share a common physical resource (cable, conduit etc.) Due to layered structure: Physical layer: Fiber spans (cable, conduit, et al) Optical layer: Optical links and nodes (a subset of the nodes in the physical layer

Layered Architecture of Optical Network
1 5 e1 e5 e7 e3 2 e2 e4 4 3 (a) Optical Layer g8 g7 g6 1 5 g1 g9 g5 2 g2 g4 4 g3 3 (b) Physical Layer 66

Protection in SRLG networks
Finding SRLG-disjoint path pair is more complicated than finding a link/node-disjoint path pair. In fact, the former is a NP-complete problem. If Backup BandWidth (backup bandwidth) sharing is considered, SRLG protection problem will become even more complicated.

30% of the time statistically) when considering SRLGs
APF and Trap Active Path First (APF), followed by an SRLG- disjoint BP attractive alternative (policy-based routing, optimal AP) But may fail to find such a BP more frequently (up to 30% of the time statistically) when considering SRLGs Trap: can’t find an SRLG-disjoint BP Real Traps: unavoidable, topology-induced Avoidable Traps: algorithm-induced. Only a few APF algorithms so far to deal with avoidable traps.

Other APF-based Heuristic
K Shortest Paths (KSP) Finds the first K shortest paths between the source and destination as candidate APs, and then test them in the increasing order of their costs, until a SRLG disjoint BP is found or all of them have been tested.

Proposed Trap Avoidance (TA) Algorithm
Similar to KSP: iteratively test candidate APs and find one that has a SRLG-disjoint BP But TA constructs one AP at a time, and modifies it into a new AP for testing only if necessary TA uses a more intelligent method to avoid the most “risky” link when modifying the AP KSP is oblivious/blind to “bad” links See Xu et al. IEEE/OSA Journal of Lightwave Technology (JLT), Special Issue on Optical Networks, Vol. 21, No. 11, pp , 2003 70

infinity to prevent them from being used by BP.
Find a Candidate BP All the directed links along AP assigned a cost of infinity to prevent them from being used by BP. All the remaining links that share at least one SRLG with any link on AP (including the links along the reversed AP) will be assigned a large value M as cost. Discourage any shortest-path algorithm to use such “M” links for the candidate BP. But do not forbid.

PROtection using MultIple
SEgments (PROMISE)

Logically “divide” an AP into several,
Basic idea of PROMISE Logically “divide” an AP into several, possibly overlapping sub-path called active segments (AS’s), and then protect each AS with a backup segment (BS) AS 1 AS 2 BS 1 AP BS 2

- Recall that traps are more likely in SRLG networks
Applications for PROMISE First proposed for Non-SRLG networks Particularly effective in dealing with either real or avoidable traps - Recall that traps are more likely in SRLG networks

Most Bandwidth Efficient: I
Reason: the flexibility it offers in choosing the appropriate AS’s and corresponding BS’s. AP1 and AP2 are not link disjoint, so their BPs cannot share backup bandwidth But in PROMISE, BS1,1 and BS2,1 can share backup bandwidth AS 1,1 AS 1,2 AP1 BS 1,1 BS 2,1 AP2 80

Most Bandwidth Efficient: II
Inter-Sharing: Sharing between BS’s for different connections Intra-Sharing: Sharing between BS’s of the same connection, e.g. BS1 and BS2 share backup bandwidth on link c AS1 AS2 AS3 b c f BS2 BS1 BS3 a e d g 1 2 3 4 5 6

Path Protection: 1-(1-0.8 2)2  0.87
Faster Recovery and More Resilient Faster Recovery: Protects each AS using a shorter BS instead of protecting the entire AP using a longer BP (as in path protection) More Resilient/Robust: Tolerate more multiple failures than path protection (with the same or lower bandwidth consumption). Overall Reliability •Link failure prob: x = y = p = q =0.8 BS1 p Path Protection: 1-( )2  0.87 PROMISE: (1-(1-0.8)2)2 0.92 x y s 2 d BS2 q

Other Benefits of PROMISE
Can Succeed When Other Approaches Fail Routing policies, QoS constraints (e.g., hop limit on the AP and BP), or just APF Real/Avoidable Traps in SRLG networks Readily be applied to MPLS networks by extending the existing protocols for local repair/recovery in MPLS networks

Key Challenges in PROMISE
Joint optimization of AP selection and the set of protecting BS's is extremely complex Even if AP is found first as in APF-based heuristic, How to optimally divide AP into AS’s (then corresponding BS's) Harder than modeling the general multi-commodity flow problem: number of BS’s, and the source and destination for each BS are not known beforehand.

Network Survivability, Reliability and Availability: Protection & Restoration Zilong Ye, Ph.D. zye5@calstatela.edu.

Similar presentations

Presentation on theme: "Network Survivability, Reliability and Availability: Protection & Restoration Zilong Ye, Ph.D. zye5@calstatela.edu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Network Survivability, Reliability and Availability: Protection & Restoration Zilong Ye, Ph.D. zye5@calstatela.edu.

Similar presentations

Presentation on theme: "Network Survivability, Reliability and Availability: Protection & Restoration Zilong Ye, Ph.D. zye5@calstatela.edu."— Presentation transcript:

Similar presentations

About project

Feedback