Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design.

Similar presentations


Presentation on theme: "1 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design."— Presentation transcript:

1 1 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management CSE4884 Network Design and Management Lecturer: Ken Fletcher Lecture 8 Reliability, Availability & Maintainability (RAM) Also known as RMA When will it fail? How will it fail? How long will it be out of service?

2 2 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management References n Web references: Search for “MTBF” n Tool Vendors www.t-cubed.com –Contains good information and demo tools www.relexsoftware.com –Good site n MIL-HDBK-217 The USA military standard for this topic (search the www for it – or see http://www.t-cubed.com/ for more details)http://www.t-cubed.com/

3 3 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Thinking time n Write down Average time between power supply failures –Eg how many seconds, days, weeks or months between failures, on average Average duration of an outage –Eg how many seconds, minutes, hours, or days, on average Calculate your estimate of the availability of the power supply by Approximate Availability = 1- (average outage time / average time between failures) n Is this a good or bad figure? We will review this figure later

4 4 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management RAM Issues n RAM is concerned with failures and recovery of: Lines or circuits; Node equipment (switches, routers etc); and Terminal equipments ie failures and recovery of all aspects of the network n These equipments and circuits may fail due to: –Environment failures; eg electricity supply failure, floods, air conditioning failures etc; –Hardware errors; –Software Errors; and –Operator errors n Specifically excluded are: –User errors; and –Deliberate attacks eg data flooding, vandalism, and deliberate crime

5 5 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Network RAM n RAM is a through-life management issue, not simply a maintenance issue n RAM Engineering: –Starts with the User Specification; –Needs to be included in System Design; –Continues through to end-of-life; and –Decommissioning and disposal of the system. n Some points: “Logistic Support” term often used in this area ‘Maintenance’ is a major activity within RAM engineering Proper ‘Network Management’ is crucial

6 6 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Aspects for ‘Good’ RAM? What are the aspects leading to good levels of Reliability, Availability, and Maintainability of a ‘system’?

7 7 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Aspects of System-level RAM n Avoiding failures - a DESIGN Issue Reliable components/sub-systems eg Ensure that each major subsystem or component is adequate n Minimise Impact of Failures Use of maintainable/serviceable systems eg Ensure that failed or faulty equipment can be rapidly diagnosed and repaired or replaced Ensure spare parts are available in reasonable time frame Maintenance personnel –Well trained, available when needed Well planned procedures eg –Load shedding –System Restoration

8 8 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Avoid Failures (1) n Use reliable components Components with high “Mean Time Between Failure” (MTBF) figures –Usually ‘simple’ components have very high MTBF –Various grades of components available Domestic Commercial Light Industrial Heavy Industrial Mil Spec (Military specifications) n Operate systems and equipment within environment assumed by the design – especially –temperature, humidity, vibration limits, and –planned maintenance schedules

9 9 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Reliable Components n Which components can be trusted to be reliable? Proven brands eg IBM, Hewlett Packard, CISCO etc –Safer to use than ‘NO_NAME” brands, simply because they are known (eg good reputation), and can be traced. Reputable brands often publish MTBF figures for their components and systems n In general, the more trusted components are also more expensive to buy –savings occur because they are less trouble operationally

10 10 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Graduated Reliability * Relative reliability indicators in the table are subjective

11 11 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Graduated Reliability (2) Companies such as DY4 manufacture equipments covering specified ranges of reliability - see handouts

12 12 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Software Reliability n Hard enough to define hardware failures and reliability Defining software reliability is almost impossible n Lack of good measurements of software characteristics –only repeatable universal metrics seem to be Complexity & Function Points - and these are not good! n “Proven by use” is reasonable guide –Newly developed software is very ‘buggy’ –Software which has been in commercial use by many many sites for considerable time is probably ‘fit-for-purpose’ n Can get ‘formally proven’ software but cost is extreme ($1500-$2000 per SLOC)

13 13 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Minimise Impacts of Failures n Network failures will occur - Matter of ‘when’ failures will occur, rather than ‘if’ n Need to minimise the impact of failures n Areas generally considered: Application Design Network Design Network Management Network Operations

14 14 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Application Design n Applications can be designed to be sympathetic to the network – eg: –Edit input commands and data for validity –Hold most/all data locally, with synchronisation to master data bases taking place occasionally in ‘offline’ non-critical mode –Even if data must be held remotely, application designs can often be made less sensitive to network failures by holding critical data locally, non-critical data remotely –Applications can assist by staggering massive file transfers (eg data base backups / synchronisations) so that these are spread over time –Applications may be able to prioritise data transfer requirements (especially in emergency situations)

15 15 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Network Design n Avoid single points of failure –Communications work generally causes concentration of traffic into node switches and inter-node circuits –Duplication of equipment & circuits is very expensive n Sensible use of redundant equipment –Most redundant equipment is not appropriate today eg Power supply reliability, etc based on gut feel of 30 years ago –Hot / Warm / Cold Standby Equipments –Redundant equipment may be very expensive and create more problems that it fixes Consider partial duplication - ie identify most critical aspects and protect these by installing redundant equipment

16 16 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Network Management n Planning, planning and more planning is the key! n Have plans developed, approved, printed and people trained before the incidents occur. –Too late to do this when incident occurs –Situation becomes too confused to make sensible decisions –MOST CRITICAL – someone delegated to take charge of situation. n Plans Needed: Load-shedding plans - for partial failures Network Restoration plan with priorities for restoration after a failure –who/what is restored first or last? –who has the authority to make decisions? Maintenance plans and procedures –Fast response to maintenance call-outs for failures

17 17 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Network Operations n Back up of software and critical data n Active oversight and inspections of equipment and traffic n Active supervision of external maintenance personnel n Record keeping of: fault reports and fault corrective actions system and network reconfiguration actions visitors to computer rooms, network hubs etc external events and incidents even if network not directly affected –power supply changes / transients / outages –air conditioning outages and maintenance activity

18 18 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Network Operations (2) n Ops should monitor and perform ‘trend analysis’ on –Normal operations (so that ‘abnormal’ becomes known) –Faults - type and frequency –Maintenance call response times –Maintenance service times n Resources needed –Trained operations staff who: can diagnose problems know who to call (and have authority to do so) can oversight the maintenance technicians –Good operational diagnosis and management tools –Trained maintainers from maintenance organisation –Good documentation and guides –Good Procedures – see next page

19 19 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Procedures n Good procedures are needed for: –Normal Operations –Maintenance Call-outs –Maintenance activity –Recovery and Restoration of Services –Configuration Management – (change control) –Auditing the system configuration –Adding new starters –Modifying privileges of existing personnel –Cleaning up when someone leaves – either In benign circumstances (easy) and In difficult circumstances (ie someone leaves in disgrace)

20 20 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Issues for consideration n Mirror disks and RAID configurations only address hardware errors n Auto restart after power failure is not necessarily feasible for large systems n Software and application restart times may be 30 to 120+ minutes n Hot standby is expensive to implement –Checking heartbeat, maintaining synchronisation etc of multiple systems is difficult n How to define failure –Difficult to define in large systems –EG if ‘N’ terminals are connected and one fails, is this considered ‘System Failure’? – probably if N=1, Not if N=10,000 n Issue really comes down to balancing cost of outage against cost of buying and maintaining additional equipment

21 21 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Early Space Shuttle Experience n A computer was required for controlling re-entry into atmosphere But life is tough up there, and a single computer may fail So, install two computers –But if they disagree, which is correct? Install three computers, with ‘voting’ logic (complexity?) The above addressed hardware failures only The first launch was aborted 20 minutes before takeoff because the five computers did not synchronise (later analysis showed 1 in 32 chance of not synchronising – problem occurred once in testing but was ignored as it did not recur) But three computers are three times as likely to fail as one computer, therefore install a fourth to track the operational computers and automatically switch in as needed (yet more complexity) So fifth computer was fitted, with software developed to the same specifications by another company - #5 to be switched manually if needed

22 22 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Analytical Approaches Some simple calculations

23 23 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Terminology - (not so common) n MTBF, MTTR and Availability are common figures used n MTBF is Mean Time Between Failures –Mathematical (Mythical?) average time between failures –Typical range is 1,000 hours (complex systems) to 300,000+ hours (simple system or component, built to MIL-STD requirements) –Generally source MTBF figures from vendors n MTTR is the Mean Time To fix a failure ‘Fix’ has many definitions – Replace, Repair, or Restore –Terms ‘Repair’ and ‘Replace’ are used by vendors This is the time taken to ‘fix’ the component once the technician and spare parts are available on site. –Term ‘Restore’ is more meaningful to user This is the time to restore the service to operational state n Availability is the percentage of time that a component or system is operationally usable ie (time available for operations) / (Total time)

24 24 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Failure Rates over Life of Equipment n Most components exhibit a ‘bathtub’ characteristic curve n Show a high failure rate initially, then settle down, until age catches up near the end of their life n ‘Burning in’ is a term referring to the first few hours until the relatively flat section of the curve is achieved. Reputable manufacturers usually perform ‘burn in’ as part of testing Component Age (years of operations) Failures per unit time

25 25 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Failure Rates (2) n Over the relatively flat portion of the bathtub curve, failures are random, and generally show a Poisson or exponential characteristic n If average time between failures is M hours (usually called Mean Time Between Failures or MTBF) n Then Probability that the component will operate for period greater than Time “t” hours is given by n P>t = e -(t/M) Probability of operating -(t/m) P>t = e

26 26 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Prob of a Component Operating

27 27 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Prob of Component Operating (2) n Note from the graph Probability of operating longer than M is about 38% This means that about 62% of components fail within the period M - ie almost 2/3 fail within MTBF –Most components fail before MTBF figure is reached n Approximately: –62% fail before MTBF is reached –38% of components exceed MTBF before they fail –12% exceed 2*MTBF - ie go on for at least twice MTBF –5% exceed 3*MTBF - owners of these are very happy people!

28 28 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management MTBF and MTTR Usage n MTBF is Mean Time Between Failures –Mathematical (Mythical?) average time between failures –Typical range is 1,000 hours (complex systems) to 300,000+ hours (simple system or component, built to MIL-STD requirements) –Generally source MTBF figures from vendors –MTBF is commonly applied, even if term not well understood n MTTR is the Mean Time To fix a failure ‘Fix’ has many definitions – Replace, Repair, or Restore –Terms ‘Repair’ and ‘Replace’ are used by vendors This is the time taken to ‘fix’ the component once the technician and spare parts are available on site. –Term ‘Restore’ is more meaningful to user This is the time to restore the service to operational state Mean Time To Restore is often referred to Mean Down Time (MDT)

29 29 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management MTTR, (where R = Restore service) n Time to restore service has many components: n Generally take MTTR figures as Vendors MTTR (Repair time), plus other delays ActionTypical Times Detect fault and lodge call-out10 - 30 minutes Response time by technician or service organisation 120 to 240 minutes (2 to 4 hours) Repair time by technician (if parts available) 30 minutes Reboot system and restart Application15 to 120 minutes Total Time to ‘Fix’3 to 7+ hours

30 30 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Availability n Availability is defined as the probability of a system (or component etc) being available when needed usually expressed as a %age of total time n Mathematically: Total time =MTBF + MTTR Availability= (Time operational) / (total time) = (MTBF) / (MTBF and MTTR) OR=(1 – (MTTR / (MTBF + MTTR)) n EG MTBF of 1,000 hours, MTTR of 1 hour n Availability = (1,000) / (1,000+1) =99.9% (approx) Availability is always less than 100%

31 31 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Thinking time revisited n Earlier you were asked to consider your perceptions of the average time between failures of the power supply, and the average duration of an outage. n We will all have different perceptions n Reality is: The reliability or availability of the power supply here is probably about 99.999%. This means that outages are typically measured in terms of tens of seconds per year total outage time

32 32 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management System Availability n Components in a System may be grouped as: Logically serial (also known as Cascade)– where all components must be operational or the group is failed EG A car which requires engine, gearbox and wheels Group availability = product of independent component availabilities Logically parallel Where several components are operated ‘in parallel’, but not all are required to be operational for the group to be operational EG A diesel generator for when the mains power fails Group availability = (1-unavailability) approach Some combination of both logical and serial EG any complex machine exhibits this arrangement. n Following slides cover these in more detail

33 33 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Logically Serial Components (1) n All components must be operational for system to work Diagram of three components, where all are needed: A (Server) C (Cables) B (Router) Availability (each)0.90.70.8 Availability as a group requires that all are operational  Avail (Group)= product of component’s availabilities = Avail (A) * Avail (B) * Avail (C) =0.9 * 0.7 * 0.8 =0.504= 50.4% Simple formula for group of ‘N’ identical components Avail (Group of Identical Components ) = Avail (one component) N

34 34 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Logically Serial Components (2) n Implications of Logically Serial Components: Availability (Group) must be less than Availability of the least reliable component The more units involved, the lower the probability that the group is operational – –too many components and group becomes too unreliable to consider, unless each component is extremely reliable Complex systems are inherently less reliable than simple systems, unless work undertaken to improve this situation

35 35 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Logically Parallel Components (1) Only some components from group are required to be operational Avail (Group) is dependent on which components must be operational If any of A or B or C are required, then Avail (Group)=Avail (A) or Avail (B) or Avail (C) =(1- (Combined unavailability of group)) =(1- ((1-0.9) * (1-0.7) * (1-0.8))) =(1- (0.1*0.3*0.2)) = 1-0.006 = 0.994 Availability (P) 0.7 0.8 0.9 A (Mains power) C (Generator) B (Battery) Unavailability (Q) = (1-0.9) = 0.1 = (1-0.7) = 0.3 = (1-0.8) = 0.2

36 36 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management n Implications of Logically Parallel Components, where not all are required for the group to be operational: Availability (Group) is greater than Availability of the most reliable component The more components “in parallel” which are allowed be in failed state without declaring the group failed, the higher the probability that the group is operational – eg a “one needed out three” arrangement is better than a “two out of three” (but more expensive) Many configurations are possible Calculations can be laborious! Logically Parallel Components (2)

37 37 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Identical Components in Parallel (1) n Consider the case for three identical components Availability (P) 0.9 A C B Unavailability (Q) = (1-0.9) = 0.1 Avail (Group) is dependent on which components must be operational If only one of A or B or C are required, then Avail (Group)=Avail (A) or Avail (B) or Avail (C) =(1- (Combined unavailability of group)) =(1- (0.1*0.1*0.1)) = 1-0.001 = 0.999 OR(1-Unavail 3 ) = (1-0.1 3 ) = 1-0.001 = 0.999

38 38 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Identical Components in Parallel (2) n Following two slides show a set of diagrams of common configurations of multiple components in parallel, and the formulas corresponding to each if the components are identical. n Space constraints prevent all terms in some formulas from being shown eg “+ (3 terms)” In these cases, look at the preceding terms and determine the pattern of P and Q usage, then repeat it for the missing terms n Spreadsheet Reliabli.xls (from the web page) calculates these formulas

39 39 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management 100% 50% 33% 50% 25% 33% 0 1 2 3 4 5 6 7 Basic 1 out of 2 1 out of 33 out of 4 4 out of 5 2 out of 3 3 out of 5 2 out of 4 Identical Components in Parallel (3)

40 40 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Diagram Comment Avail(Group) = 0 Basic Concept P1 1 1 out of 2, 100% redundancy 1-Q1Q2 2 1 out of 3, 200% redundancy 1-Q1Q2Q3 3 2 out of 3, 50% redundancy P1P2P3 +P1P2Q3+P1Q2P3+Q1P2P3 4 3 out of 4, 33% redundancy P1P2P3P4 +P1P2P3Q4+P1P2Q3P4+P1Q2P3P4+Q1P2P3P4 5 4 out of 5, 25% redundancy P1P2P3P4P5 +P1P2P3P4Q5+P1P2P3Q4P5 + (2 terms) +Q1P2P3P4P5 6 3 out of 5, 66% redundancy P1P2P3P4P5 +P1P2P3P4Q5 + (3 terms) +Q1P2P3P4P5 +P1P2P3Q4Q5 + (8 terms) +Q1Q2P3P4P5 7 2 out of 4, 100% redundancy P1P2P3P4 +P1P2P3Q4 +(2 terms) +Q1P2P3P4 +P1P2Q3Q4 +(4 terms) +Q1Q2P3P4 Identical Components in Parallel (4) NOTE: Spreadsheet Reliabili.xls on web page calculates these

41 41 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Spreadsheet Relibabli.xls Note: Calculations for ‘serial components’ are lower down sheet

42 42 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Hot and Cold standby n The spreadsheet shows Hot Standby and Cold Standby MTBF figures for groups in various configurations n MTBF and MTTR are set to show effects – real systems have MTBF much higher than this. Hot Standby assumes all units to be operating, and hence likely to fail even if they are not being used for operations Cold standby assumes units are not operating until required and hence are not likely to fail when not operational. Cold Standby also assumes instant changeover

43 43 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Typical Availability Diagrams (1) n Real Systems have both serial and logical groupings Eg a Car requires Engine, Gearbox and Wheels (forget steering, heater, doors, lights etc) But wheels are a group where four out of five are required Problem is difficult - but can be simplified Engine Gearbox Wheel 1/1 4/5 1/1

44 44 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Typical Availability Diagrams (2) n The car diagram needs to be simplified by converting the five components “Wheels” to a single “Wheel Group” ie Solve the Wheels (four out of five) problem first, then the Avail (CAR) problem becomes: n The problem is now a simple serial system with three components From this we can calculate system level (ie for whole car) –Availability –MTBF and –MTTR EngineGearbox Wheels (as a group) CAR =

45 45 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Availability Example n Car requires an engine, gearbox and wheels to be operational n Availability of Group (CAR) = A(e) * A(g) * A(w) = 0.9967*0.9975*0.9980 = 0.9922 n IE The availability of the CAR is less than the availabilities of any of the component items – the more items, the lower the availability EngineGearboxWheels (Group) Given MTBF600 hours2000 hours500 hours Given MTTR2 hours5 hours1 hour Calculated Availability A(e) = 0.9967A(g) = 0.9975A(w) = 0.9980 EngineGearbox Wheels (as a group) CAR =

46 46 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management MTBF/MTTR for System “CAR” n Select a convenient period over which to calculate (Lowest common multiplier is good) n Let us say 12,000 hours In this time CAR will require approximately: –12,000 / 600 = 20 engine repairs or services @ 2 hours each = 40 hours –6 Gearbox repairs / services @ 5 hours each = 30 hours –24 wheel services @ 1 hour each = 24 hours Assuming independent failures, total outages = 50 outages in 12,000 hours, taking up (40 + 30 + 24) = 94 hours MTBF (CAR) = 12,000 / 50 = 240 hours MTTR (CAR) = 94 / 50 = 1.88 hours

47 47 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Availability and Redundant Equipment n With redundant equipment installed, there are several possible states: –Fully operational All equipment, including redundant equipment, is ready –Operating at Risk (ie some equipment failed, but full load being carried) eg a car just after replacing a flat tire with the spare wheel –Degraded mode Only partial, but not full, operations being conducted due to various equipment outages - eg partial operations only –Not operational System is degraded so badly that it is defined as “not operational” Need to determine Availability for the group of equipments, and from that, an aggregate MTBF

48 48 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design & Management Summary n General term for this topic is “Integrated Logistics Support” (ILS) n ILS is a specialised branch of engineering n This lecture covered only some of the basic concepts Call in a specialist for large or mission critical systems n Having a system with good RAM is more than calculations – it requires through-life management of design, maintenance, operations and change control (configuration management)


Download ppt "1 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 15-Oct-15 01:32 Prepared for: Monash University Subj: CSE4884 Network Design."

Similar presentations


Ads by Google