Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design.

Similar presentations


Presentation on theme: "1 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design."— Presentation transcript:

1 1 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management CSE4884 Network Design and Management Lecturer: Ken Fletcher Lecture 12 Revision set 2 This set consists of extracts from lectures 6 through 8

2 2 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management CSE4884 Network Design and Management Lecturer: Ken Fletcher Lecture 6 Loss Systems Theory When a system cannot or does not queue traffic items For example: non-computerised telephone exchanges

3 3 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Circuit Switching Fundamentals n Loss Systems Theory has traditionally been considered as a telephony problem, but becoming more important in the computer communications designs. n A call may result in LOSS (or WAIT) if: All destinations are busy (Known as Called party busy) All switching capacity is taken (Known as Network congestion). Switched Network Called Party Calling Party

4 4 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Examples of Loss Systems n Traditional examples of Loss Based Systems: –Telephone Networks (How many circuits or trunks needed?) –PABX Design (How many incoming/outgoing circuits?) n Computer Examples: (How many items needed?) –Dial-up ports for input data –Dial-up fax outlets –Virtual Circuits on X.25 networks –Switch Capacity (sizes of tables) –Operating System Tables such as numbers of tasks or I/O devices permitted (eg how many entries required in the tables?)

5 5 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Holding Times n In Circuit switching, the concerns are: a.Arrival rate of calls b.Holding Times (equivalent to service time) n For random holding times, with mean of `h’ units Probability of a call exceeding t units is n File ‘Callhold.xls’ on the web pages computes a table from this formula using a wide range of average holding times. However a graph is easier …Callhold.xls

6 6 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Probability of Exceeding Holding Time Consider a system with an average holding time of 3 minutes. What percentage of calls exceed 3 minutes? Ratio t/mean holding time = 3/3 =1 Percentage exceeding this ratio is 36.8% Therefore 64.2% are shorter than average time In fact: 13.5% are less than 2 x average 5% are less than 3 x average

7 7 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Measures of Traffic Loading - Erlangs n Key factor is the concept of OCCUPIED CAPACITY eg 2 x 10 minute calls are equivalent to 1 x 20 minute call, as far as `occupation’ is concerned. n Traffic loading measured in ERLANGS - Abbreviation ‘E’ Erlangs of traffic (or Traffic Load) = # Calls per unit time * Avg Holding Time per call eg20 calls in 60 minutes, average holding time = 3 minutes = (20/60) * 3 = 1 Erlang = 1E If all the traffic items neatly stacked end-to-end totally occupied the link for the period considered, then there is one Erlang of Traffic.

8 8 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Erlang Calculation Examples n 30 calls in 120 minutes, average holding time = 4 minutes = (30 / 120) * 4 = 1 E n 20 calls in 60 minutes, average holding time = 9 minutes = (20 / 60) * 9 = 3 E n 5 calls in 30 minutes, average holding time = 3 minutes = (5 / 30) * 3 = 0.5 E n 4 Erlangs of Traffic is measured over a period of 3 hours. There were 120 calls during the time. What was the average holding time? (6 minutes)

9 9 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Blocking n When a call is presented to a switching system that is busy, it can be handled in one of three ways: Lost(Busy Tone); or Delayed(Queued, with a message or music); or it can Pre-empt an existing call (chop off the pre-existing call) (Rare situation) n Important simplification is made by assuming that the `BUSY-NESS’ of the system and the arrival rate of calls are independent. The assumptions are that: a.calls are drawn from an infinite population; and b.the period of interest represents a consistent probability of activity (that is, the average probability of arrivals of traffic over the `busy period’ is constant, and is not influenced by short term highs and lows of activity).

10 10 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Degree of Congestion (DoC) n Important issue is probability of a call being serviced or lost ‘Degree of Congestion’ (formerly called ‘Grade of Service’) Defined as percentage of calls which fail (during ‘busy period’) May differ for different applications: eg 0.5% (1:200) for outgoing Voice 0.1% (1:1000) for incoming voice/fax 5% (1:20) for outgoing fax line with auto dialling 2% (1:50) for intersite tie line 5% to 20%for terminals in library or ‘terminal room’

11 11 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Calculations for Loss Systems

12 12 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Erlang-B Formula n If blocked calls are lost, then the probability of k servers busy out of M is given by the Erlang-B formula shown below: n Putting k = M gives the probability of all servers busy, that is, a blocked situation n Speadsheet model ‘ERLANGB.XLS’ implements this formula.

13 13 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Erlang B Graphs n These problems are normally solved by reference to graphs, calculators available from the Internet, eg http://www.erlang.com orhttp://www.erlang.com The spreadshheet model “ERLANGB.XLS” on the subject web pagesERLANGB.XLS n Graphs of the Erlang B function were handed out during lectures. As well, the spreadsheet model “ERLANGB.XLS” may be used to plot tables or graphs as you wish. This model is available via ‘Tools and Toys’ on the subject web pageERLANGB.XLS n The next slide shows the curves which apply to relatively low traffic loads, that is, loads from 0.2E to 2.4E. (A later slide shows graphs for higher rates of activity.)

14 14 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Erlang B Graph - Low Offered Traffic 1 Server 2 Servers 3 Servers 4 Servers 5 Servers 6 Servers 7 Servers 8 Servers

15 15 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Erlang B - Examples using Graphs (1) n Using the curves on the previous slide, answer the following questions: 1.1.How many servers are required to provide 10% DoC, with an offered load of 1.3E?(3) 1.2.What is the DoC for 3 servers handling 0.6E?(2%) 1.3.How much offered traffic causes 5% congestion with 2 servers?(0.4E) 1.4.What is the DoC which allows 90% of offered traffic to pass?(10%) 1.5.How many lines are required for outgoing voice which offers 0.7E of traffic? (4)

16 16 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Erlang B - Examples using Graphs (2) 2.1 How many servers would you recommend to handle 0.5E at 1% DoC?(3) 2.2.How many servers are required to handle 1.2E at 1% DoC?(5) 2.3You have two lines for customers to call your shop to place orders. Customers are complaining of poor service for their calls to your shop. A traffic survey shows that you receive 12 calls in the ‘busy hour’, and each call averages 4 minutes. Do your customers really experience excessive congestion? (YES - approx 15% cant get through first time) What would be the effect of adding another line? (approx 4% would still experience congestion - not good) What would be the effect of adding two lines? (approx 0.75% would experience congestion - acceptable) Is there anything else that you could do? (reduce service time - for example reducing service time to 3 minutes without adding any lines would reduce the congestion from 15% to 10%)

17 17 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Erlang B - Examples using Graphs (3) 3.1You need to establish an Internet link into your organisation. However, because of security constraints, the Internet terminals will not be connected to your main network, but will be a separate network with several connected terminals. These terminals may be located in a single area, or may be dispersed around the building. Staff wanting to use Internet will leave their normal desks and walk over to use the Internet terminal ‘in the corner’. The staff will be frustrated if they cannot immediately get a terminal. Management have decided that 80% probability of finding an available terminal is desirable. n There are 40 staff, who each want to connect to Internet for two hours per week. n The office is open five days per week, from 8:00am (0800) to 6:00pm (1800), with most people wanting to access Internet between 9:00am (0900) and 5:00pm (1700). n Estimate how many Internet terminals should be installed (3) n Is it best to locate these in the same area, or to spread them around the building? Briefly give reasons for your answer. (same area, because this allows a larger ‘pool’ of units to use. Also, staff will use a piece of paper or a board to write up the ‘queue’ of people waiting to use the terminals. This will change it from a loss system to a queueing system, which is more efficient use of terminals and people)

18 18 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Erlang B Graph - Larger Volumes Numbers of Servers shown beside each line. 30 32 34 36 38 40 42 44 46 48 50 52545658

19 19 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Erlang B - Examples using Graphs (4) n A new PABX was installed recently with 50 incoming and 50 outgoing exchange lines. Analysis of the figures for the busiest 15 minute period each day shows the average activity during theses busy periods as follows: IncomingOutgoing Number of Calls 102 130 Average time per call (seconds) 220 280 n Incoming lines may be converted to outgoing lines at no cost. Similarly, outgoing lines may be converted to incoming lines at no cost. n Acquisition costs for new lines are $500 installation charge. There is no refund for terminating (abandoning) a line. n The annual costs (rental and maintenance) for lines and ports is: incoming $450 and outgoing $375. n Assuming the figures are representative, what changes would you recommend?(40 inwards, 54 outwards)

20 20 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Erlang-B Approximations n For Erlang-B in the range of 5 to 50 Erlangs, the approximate number of ports or servers needed is given by these formulas: For 1% DoC, Number of Servers needed = 5.5 + 1.17E For 0.1% DoC Number of Servers needed= 7.8 + 1.28E Other formulae are available for other DoC values

21 21 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Erlang-B - Examples by Estimation n Dial up ports on a computer system –Large population –100 calls in busy hour –Avg call time is 15 minutes How ports are required to provide 1% and 0.1% DoC? n Offered Traffic = (100/60) * 15= 25 Erlangs For 1% DoC, 5.5 + (1.17 * 25)= 34.75 Use 35 For 0.1% DoC, 7.8 + (1.28 * 25) = 39.8. Use 40 n Now check against the graphs. Are there discrepancies? Which is correct? (Remember, these formulas are approximations only)

22 22 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Some thoughts n We use graphs, mathematical and simulation models to help us understand and predict the behaviour of ‘real systems’. –Queueing theory, loss systems theory, Erlang B and C calculators, graphs, Gamma tables etc are only aids to help us to understand and predict the ‘real world’. n These are not the real system - they are representations that we create given the resources (raw data, software tools, knowledge etc) and time available. n Expect that the ‘real system’ will differ from the predicted behaviour – and remember that ‘garbage in means garbage out’ –The better that you can model the ‘real world’ (ie raw traffic flows and tools used), then the more accurately you can predict the future.

23 23 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Some webpointers n Good Erlang calculators are available on the web. n Try –http://64.78.47.103/AspectCalc/Index.asp for the Aspect Erlang C calculator. This requires a reboot when installing it.http://64.78.47.103/AspectCalc/Index.asp –http://www.erlang.com/ for the Westbay calculators (Eight calculators available) http://www.erlang.com/calculator/erlb/ for Erlang B http://www.erlang.com/calculator/erlc/ for Erlang Chttp://www.erlang.com/ http://www.erlang.com/calculator/erlb/ http://www.erlang.com/calculator/erlc/ –http://www.kooltoolz.com (NOTE the ‘Z’ on ‘kooltoolz’) The ‘light’ tool is freeware, but it takes 12.5MB to download, and must be installed.http://www.kooltoolz.com

24 24 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management CSE4884 Network Design and Management Lecturer: Ken Fletcher Lecture 7 Effects of Line Errors Line errors occur on all communications circuits Their impacts depend on many factors

25 25 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management References n Reference: James Martin “Systems Analysis for Data Transmission” Chapter 35 (Copies were handed out during the lectures) The examples in the handout date from the late 1960s, but the concepts are unchanged.

26 26 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Line Errors n Also known as ‘hits’ on the line n Show as discrepancies between transmitted and received data ie received data is not the same as transmitted data n Caused by: Atmospheric anomalies (particularly radio links) –Heavy rain attenuates microwave and satellite signals –Lightning flashes introduce spurious data –Ionospheric non-conformities cause HF radio problems Random electronic noise on wireline and fibre optic circuits –Eg amplifier noises, thermal noise etc Loose connections or poor quality links causing ‘static’ –Eg inconsistent electrical or optical paths Loose cable connections causing line errors when someone walks nearby

27 27 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management BER and ‘Errored Seconds’ n Localised digital circuits, –(fibre optics, local wirelines and ‘local’ radio work (UHF, microwave, infra-red), etc), Tendency is for errors to be in sporadic bursts –Ie long periods of operation at a very low error rate interspersed with short periods of error conditions Usually expressed as ‘error free seconds’ or ‘severely errored seconds’ n Long distance circuits and analog circuits –(HF/VHF radio, satellites, undersea cables etc) Tendency is for errors to be random and sporadic n As always for practical situations, the truth is somewhere in between, as both types of errors occur in both circumstances n Analytical work concentrates on Bit Error Rate (BER) BER assumes that errors occur randomly (spontaneously), Assume that ‘severely errored seconds’ means ‘no data’ for that period

28 28 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management What happens with Line Errors? n Errors can be: Ignored –Sometimes valid when other mechanisms exist to resolve any problems eg the human brain can cope with very poor situations for textual work or voice transmissions Detected and re-transmitted –Requires a detection mechanism which causes a request for re-transmission – eg CRC or redundant data adequate to detect that there is an error (OSI layers 1 to 3 level), or Application level data editing to indicate ‘data out of acceptable ranges’ (eg an encoded meteorological report of snow in Singapore could be considered an error) Detected and corrected without retransmission –Requires a detection mechanism as above, AND –Sufficient redundant data to correct the problem

29 29 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management “Some circuits are error free” Maybe True – in which case you could ignore the issues, or More likely, False – it’s a matter of ‘when’, not ‘if’ n Errors will always occur sometime If your task is not ‘mission-critical’, you might simply fix the problem when it occurs (eg re-boot the system etc) n Some modems/systems promise ‘error free’ or ‘error corrected’ traffic –Eg many radio modems True – but that means that the error handling is performed in the firmware/hardware (Layers 1 or 2 and below) The errors and their handling are still degrading circuit throughput Many equipments simply reduce line speed rate and try again (and again…) Eg HF Radio modems, domestic V90 56Kbps dial up modems Very high error rates exceed the capability to correct errors, and the residual errors are then passed through to a higher OSI level to be corrected – –ie eventually it becomes an ‘error based link’ again

30 30 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Error Detection Techniques n All Layer 1 to 3 error detection systems rely on the transmission of additional information eg Packet trailers usually contain ‘redundancy check bits’ –Additional data which enables detection of errors –These are often called Cyclic Redundancy Check (CRC) bits –Other check bit codes or checksums etc are also used –Error detection Check bit systems usually use checksums between 1 and 10% of the data length Ie their transmission overhead is that the message is 1 to10% longer Old systems sometimes repeated the entire transaction (often associated with the initial transmission, in anticipation of errors) Sometimes partial repeats were used eg figures or critical key words (eg “not”) were often repeated –A poor approach as much more bandwidth-efficient mechanisms are available eg CRC checksums

31 31 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Error Correction Techniques n AKA ‘Forward Error Correction’ or FEC n These rely on a much larger checksum than simple detection –Because they need to correct, not only detect an error –Can correct multibit errors in one transmission block –Use checksums approx. 3 to 20% of the data length –Can require high compute power to handle multibit errors correctly n Increasing error rates eventually exceed the capacity to be handled, and errored blocks are passed up to higher layers n However, can allow an actual high BER to be apparently reduced, but at a cost in throughput rate –Eg BER reduces from 1 in 50,000 (true) to 1 in 1,000,000 (apparent) at cost to apparent bandwidth (say 10% error-correct bits (reduces throughput by 10%)) “There ain’t no such thing as a free meal”

32 32 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Simple BER Calculations n Assuming –Errors are random, and –No other lower level error-recovery system exists ie we are working at the lowest level Then for BER of 1 error in 10 6 bits –Probability (any single bit in error) = P B = 10 -6

33 33 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Probability of an Error in a Packet n Let the probability of a bit being in error = P B Therefore the probability of any single bit being correct is (1- P B ) –and hence the probability of 2 independent bits in a data stream being correct is (1- P B )**2 Then for a packet of N bits, probability of packet being received error free is the probability that each and every bit is correct P(Good) = (1- P B ) N Therefore the probability that there is an error in the packet is P (Error) = (1-P(Good)) = (1-(1- P B )**N)

34 34 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Transmission Impact n Assume that packets with errors (“errored packets”) are detected instantaneously and automatically retransmitted from the sender. (Forget queuing delays for the moment) n Average that the time for a packet to be transmitted and received error free = service time = (packet size) / (line speed) n But if the packet contains an error, it must be retransmitted and it may be that the retransmission has an error etc... n Therefore errored packets are transmitted at least twice. (each re-transmission has the same probability of error as the original fortunately this is very small in practice – less than 1% in most cases) n Therefore the average packet service time becomes Service time = (Size/Speed)*(1/ P(Good))

35 35 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Example n Consider a 10,000 bit packet (10 4 bits) n BER = 1 in 100,000 (P B = 10 -5) n Line speed 9600 bps n For this example, we will use N= 10 4 and P B = 10 -5 P (Good) = (1- 10 -5 )**10 4 = 0.905, and P (Error) = (1-P(Good)) = (1- 0.905) =.095 –ie 9.5% probability of a packet being received with at least one error in it Average time to transmit the packet (assuming first retransmission is OK) = 10,000/9600 * (1/0.905) = 1.151 seconds –giving an effective line speed of 10,000/1.151 = 8688bps For comparison - the nominal service time assuming no errors is 10,000/9600 = 1.042 seconds n IE Line speed has dropped approx 10% because of the 1/100,000 BER! n The real impact of BER is many times the expected impact, due to the size of the packet or block of data

36 36 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Example discussions n The probability of an error is dependent on the length of the transmission (or packet) For a constant BER Shorter packets have lower probability of errors –But he packetisation and turnaround overheads are a higher percentage of the time Longer packets have greater probability of errors –But lower percentage of overheads More on the effects of overheads later n The following graph shows the effects of size and BER Note the logarithmic scale on the horizontal axis.

37 37 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Spreadsheet BERCALCS.xls n Spreadsheet BERCALCS.XLS plots this degradation graph See ‘Tools and Toys’ on the web pages BER 1 bit error per 1,000 bits BER 1 bit error per 10,000 bits BER 1 bit error per 100,000 bits Red lines indicate: 10,000 bit packet with a BER of 1:100,000 has degradation of approx 10%

38 38 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Summary of BER impacts n The effect of BER is many times greater than intuitively expected due to the size of blocks or packets –This is because an error of any bit (or multiple bits) in a packet causes retransmission of the entire packet. –There is a probability of further re-transmissions being required if the first re- transmission fails – generally less than 1% probability, but depends on BER and size n Moral of the story: Short packets are more likely to be transmitted without errors Long packets are more likely to contain errors, causing retransmission, hence degradation of effective line speed Balance needed between packet size and the probability of errored packets requiring re-transmission May need to consider Forward Error correction (FEC) in critical applications n Generally easier to consider ‘effective line speed’ rather than % degradation – as effective line speed can be used directly in other calculations

39 39 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management An Easy Approximation n A crude but effective approximation where the packet size is is less than 1/5 of the expected BER is given by: n Effective speed = Nominal speed * (1- (BER * packet size in bits)) n Example (same figures as before) BER = 1 in 100,000 = 1/100,000 10,000 bit packet (ie 1/10 times the expected BER value) Line speed 9600 bps For this example, we will use N= 10 4 and P B = 10 -5 n Effective speed is approx = 9600 * (1- (1/100,000 * 10,000)) = 9600 * (1 – (10,000/100,000)) = 9600 * (1 - 0.1) = 9600 * 0.9 = 8640 bps (compared to 8688 bps when calculated fully)

40 40 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Effect of Other Factors n The previous discussion assumed: –instantaneous turnaround of positive or negative acknowledgments, –no polling etc n In real life a finite time is taken to determine if a packet is errored, and to send back the negative acknowledgement which triggers retransmission. n As well, the smaller the packet, the higher is the percentage of packetisation overheads n Simulation may be best way to form an estimate of complex situations with polling or a BER which varies significantly n James Martin Chapter 35 has some interesting graphs (figures are obsolete, concept is valid)

41 41 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Real Impacts of Errors n The real performance characteristic of any link for some particular set of Nominal Line Speed, turnaround time and BER is: n IE For every situation, there is some optimal packet size, which is the best compromise between BER, packet size, and turnaround time Packet or Block size Effective line speed

42 42 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Optimal Packet or Blocksize n Spreadsheet Errorbps.xls can be used to find an optimal packet size by trial and error approaches Use this to get an approximation of the situation for your system, and then experiment on-line n Simulation techniques can also be used n Another spreadsheet which can be used is Packets4.xls

43 43 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management CSE4884 Network Design and Management Lecturer: Ken Fletcher Lecture 8 Reliability, Availability & Maintainability (RAM) Your client needs a reliable network – but how reliable and at what cost?

44 44 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management References n Web references: Search for “MTBF” n Tool Vendors www.t-cubed.com –Contains good information and demo tools www.relexsoftware.com –Good site n MIL-HDBK-217 The USA military standard for this topic (search the www for it)

45 45 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management RAM Issues n RAM is concerned with failures and recovery of: Lines or circuits; Node equipment (switches, routers etc); and Terminal equipments ie failures and recovery of all aspects of the network n These equipments and circuits may fail due to: –Environment failures; eg electricity supply failure, floods, air conditioning failures etc; –Hardware errors; –Software Errors; and –Operator errors n Specifically excluded are: –User errors; and –Deliberate attacks eg data flooding, vandalism, and deliberate crime

46 46 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Network RAM n RAM is a through-life management issue, not simply a maintenance issue n RAM Engineering: –Starts with the User Specification; –Needs to be included in System Design; –Continues through to end-of-life; and –Decommissioning and disposal of the system. n Some points: “Logistic Support” term often used in this area ‘Maintenance’ is a major activity within RAM engineering Proper ‘Network Management’ is crucial

47 47 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Aspects for ‘Good’ RAM? What are the aspects leading to good levels of Reliability, Availability, and Maintainability of a ‘system’?

48 48 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Aspects of System-level RAM n Avoiding failures - a DESIGN Issue Reliable components/sub-systems eg Ensure that each major subsystem or component is adequate n Minimise Impact of Failures Use of maintainable/serviceable systems eg Ensure that failed or faulty equipment can be rapidly diagnosed and repaired or replaced Ensure spare parts are available in reasonable time frame Maintenance personnel –Well trained, available when needed Well planned procedures eg –Load shedding when system is under stress – eg partial failures –System Restoration – usually neglected

49 49 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Avoid Failures (1) n Use reliable components Components with high “Mean Time Between Failure” (MTBF) figures –Usually ‘simple’ components have very high MTBF –Various grades of components available Domestic Commercial Light Industrial Heavy Industrial Mil Spec (Military specifications) n Operate systems and equipment within environment assumed by the design – especially –temperature, humidity, vibration limits, and –planned maintenance schedules

50 50 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Reliable Components n Which components can be trusted to be reliable? Proven brands eg IBM, Hewlett Packard, CISCO etc –Safer to use than ‘NO_NAME” brands, simply because they are known (eg good reputation), and can be traced. Reputable brands often publish MTBF figures for their components and systems n In general, the more trusted components are also more expensive to buy –savings occur because they are less trouble operationally

51 51 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Graduated Reliability * Relative reliability indicators in the table are subjective

52 52 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Software Reliability n Hard enough to define hardware failures and reliability Defining software reliability is almost impossible n Lack of good measurements of software characteristics –only repeatable universal metrics seem to be Complexity & Function Points - and these are not good! n “Proven by use” is reasonable guide –Newly developed software is very ‘buggy’ –Software which has been in commercial use by many many sites for considerable time is probably ‘fit-for-purpose’ n Can get ‘formally proven’ software but cost is extreme ($1500-$2000 per SLOC)

53 53 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Minimising Impacts of Failures n Network failures will occur - Matter of ‘when’ failures will occur, rather than ‘if’ n Given that we cannot totally avoid failures, we need to minimise the impact of failures n Areas generally considered: –Application Design –Network Design –Network Management –Network Operations We will look at these areas one by one

54 54 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Application Design n Applications can be designed to be sympathetic to the network – eg: –Edit input commands and data for validity –Hold most/all data locally, with synchronisation to master data bases taking place occasionally in ‘offline’ non-critical mode –Even if data must be held remotely, application designs can often be made less sensitive to network failures by holding critical data locally, non-critical data remotely –Applications can assist by staggering massive file transfers (eg data base backups / synchronisations) so that these are spread over time –Applications may be able to prioritise data transfer requirements (especially in emergency situations)

55 55 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Network Design n Avoid single points of failure –Communications work generally causes concentration of traffic into node switches and inter-node circuits –Duplication of equipment & circuits is very expensive n Sensible use of redundant equipment –Most redundant equipment is not appropriate today eg Power supply reliability, etc based on gut feel of 30 years ago –Hot / Warm / Cold Standby Equipments –Redundant equipment may be very expensive and create more problems that it fixes Consider partial duplication - ie identify most critical aspects and protect these by installing redundant equipment More on this later

56 56 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Network Management n Planning, planning and more planning is the key! n Have plans developed, approved, printed, distributed and people trained in their application before the incidents occur. –Too late to do this when incident occurs –Situation becomes too confused to make sensible decisions –MOST CRITICAL issue – have someone delegated to take charge of incidents. n Plans Needed include: Load-shedding plans - for partial failures Network Restoration plan with priorities for restoration after a failure –who/what will be restored first, then second, ….down to last –Define who has the authority to make decisions when an incident occurs? Maintenance plans and procedures –Fast response to maintenance call-outs for failures –Ready supply of spare parts or alternate equipments/services

57 57 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Network Operations Back up of software and critical data/parameters Good records of design and changes/modifications Including parameter changes to equipments Active oversight and inspections of equipment and traffic Active supervision of external maintenance personnel Record keeping of: –fault reports and fault corrective actions –system and network reconfiguration actions –visitors to computer rooms, network hubs etc –external events and incidents even if network has not been directly affected power supply changes / transients / outages air conditioning outages and maintenance activity

58 58 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Network Operations (2) n Ops should monitor and perform ‘trend analysis’ on –Normal operations (so that ‘abnormal’ becomes known) –Faults - type and frequency –Maintenance call response times –Maintenance service times n Resources needed –Trained operations staff who: can diagnose problems know who to call (and have authority to do so) can oversight the maintenance technicians –Good operational diagnosis and management tools –Trained maintainers from maintenance organisation –Good documentation and guides –Good Procedures – see next page

59 59 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Procedures n Good procedures are needed for: –Normal Operations –Maintenance Call-outs –Maintenance activity –Recovery and Restoration of Services –Configuration Management – (change control) –Auditing the system configuration –Adding new starters –Modifying privileges of existing personnel –Cleaning up when someone leaves – either In benign circumstances (easy) and In difficult circumstances (ie someone leaves in disgrace)

60 60 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Issues for consideration n Mirror disks and RAID configurations only address hardware errors n Auto restart after power failure is not necessarily feasible for large systems n Software and application restart times may be 30 to 120+ minutes n Hot standby is expensive to implement –Checking heartbeat of ‘the other systems’ and maintaining synchronisation etc of multiple systems is difficult n How to define failure –Difficult to define in large systems –EG if ‘N’ terminals are connected and one fails, is this considered ‘System Failure’? – probably if N=1, BUT not if N=10,000 n Issue really comes down to balancing cost impact due to outage against cost of buying and maintaining additional equipment

61 61 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Early Space Shuttle Computers (1981?) n Required a computer for controlling re-entry into atmosphere But single computer may fail So, install two computers –But if they disagree, which is correct? Install three computers, with ‘voting’ logic (complexity?) But three computers are three times as likely to fail as one computer, therefore install a fourth to track the operational computers (yet more complexity) This addressed hardware failures only – therefore fifth computer was fitted, with software developed to the same specifications by another company - #5 to be switched online manually if needed The first launch was aborted 20 minutes before takeoff because the five computers did not synchronise (had 1 in 32 chance of not synchronising – this showed up once in many tests but was forgotten)

62 62 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Analytical Approaches Some simple calculations These will help you with concepts, and enable you to handle simple situations Call an expert for large mission critical networks

63 63 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Terminology - (not so common) n MTBF, MTTR and Availability are common figures used n MTBF is Mean Time Between Failures –Mathematical (Mythical?) average time between failures –Typical range is 1,000 hours (complex systems) to 300,000+ hours (simple system or component, built to MIL-STD requirements) –Generally source MTBF figures from vendors n MTTR is the Mean Time To fix a failure ‘Fix’ has many definitions – Replace, Repair, or Restore –Terms ‘Repair’ and ‘Replace’ are used by vendors This is the time taken to ‘fix’ the component once the technician and spare parts are available on site. –Term ‘Restore’ is most meaningful to user This is the time to restore the service to operational state n Availability is the percentage of time that a component or system is operationally usable ie (time available for operations) / (Total time)

64 64 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Failure Rates over Life of Equipment n Most components exhibit a ‘bathtub’ characteristic curve n Show a high failure rate initially, then settle down, until age catches up near the end of their life n ‘Burning in’ is a term referring to the first few hours until the relatively flat section of the curve is achieved. Reputable manufacturers usually perform ‘burn in’ as part of testing Component Age (years of operations) Failures per unit time

65 65 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Failure Rates (2) n Over the relatively flat portion of the bathtub curve, failures are random, and generally show a Poisson or exponential characteristic n If average time between failures is M hours (usually called Mean Time Between Failures or MTBF) n Then Probability that the component will operate for period greater than Time “t” hours is given by n P>t = e -(t/M) Probability of operating -(t/m) P>t = e

66 66 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Prob of a Component Operating

67 67 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Prob of Component Operating (2) n Note from the graph Probability of operating longer than M is about 38% This means that about 62% of components fail within the period M - ie almost 2/3 fail within MTBF –Most components fail before MTBF figure is reached n Approximately: –62% fail before MTBF is reached –38% of components exceed MTBF before they fail –12% exceed 2*MTBF - ie go on for at least twice MTBF –5% exceed 3*MTBF - owners of these are very happy people!

68 68 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management MTBF and MTTR Usage n MTBF is Mean Time Between Failures –Mathematical (Mythical?) average time between failures –Typical range is 1,000 hours (complex systems) to 300,000+ hours (simple system or component, built to MIL-STD requirements) –Generally source MTBF figures from vendors –MTBF is commonly applied, even if term not well understood n MTTR is the Mean Time To fix a failure ‘Fix’ has many definitions – Replace, Repair, or Restore –Terms ‘Repair’ and ‘Replace’ are used by vendors This is the time taken to ‘fix’ the component once the technician and spare parts are available on site. –Term ‘Restore’ is more meaningful to user This is the time to restore the service to operational state

69 69 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management MTTR, (where R = Restore service) n Time to restore service has many components: n Generally take MTTR figures as Vendors MTTR (Repair time), plus other delays ActionTypical Times Detect fault and lodge call-out10 - 30 minutes Response time by technician or service organisation 120 to 240 minutes (2 to 4 hours) Repair time by technician30 minutes Reboot system and Application restart time 15 to 120 minutes Total Time to ‘Fix’3 to 7+ hours

70 70 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Availability n Availability is defined as the probability of a system (or component etc) being available when needed usually expressed as a %age of total time n Mathematically: Total time =MTBF + MTTR Availability= (Time operational) / (total time) = (MTBF) / (MTBF and MTTR) OR=(1 – (MTTR / (MTBF + MTTR)) n EG MTBF of 1,000 hours, MTTR of 1 hour n Availability = (1,000) / (1,000+1) =99.9% (approx) Availability is always less than 100%

71 71 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management System Availability n Components in a System may be grouped as: Logically serial (also known as Cascade)– where all components must be operational or the group is failed EG A car which requires engine, gearbox and wheels Group availability = product of independent component availabilities Logically parallel Where several components are operated ‘in parallel’, but not all are required to be operational for the group to be operational EG A diesel generator for when the mains power fails Group availability = (1-unavailability) approach Some combination of both logical and serial EG any complex machine exhibits this arrangement. n Following slides cover these in more detail

72 72 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Logically Serial Components (1) n All components must be operational for system to work Diagram of three components, where all are needed: ACB Availability (each)0.90.70.8 Availability as a group requires that all are operational  Avail (Group)= product of component’s availabilities = Avail (A) * Avail (B) * Avail (C) =0.9 * 0.7 * 0.8 =0.504= 50.4% Simple formula for group of ‘N’ identical components Avail (Group of Identical Components ) = Avail (one component) N

73 73 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Logically Serial Components (2) n Implications of Logically Serial Components: Availability (Group) must be less than Availability of the least reliable component The more units involved, the lower the probability that the group is operational – –too many components and group becomes too unreliable to consider, unless each component is extremely reliable Complex systems are inherently less reliable than simple systems, unless work undertaken to improve this situation

74 74 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Logically Parallel Components (1) Only some components from group are required to be operational Avail (Group) is dependent on which components must be operational If any of A or B or C are required, then Avail (Group)=Avail (A) or Avail (B) or Avail (C) =(1- (Combined unavailability of group)) =(1- ((1-0.9) * (1-0.7) * (1-0.8))) =(1- (0.1*0.3*0.2)) = 1-0.006 = 0.994 Availability (P) 0.7 0.8 0.9 A C B Unavailability (Q) = (1-0.9) = 0.1 = (1-0.7) = 0.3 = (1-0.8) = 0.2

75 75 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management n Implications of Logically Parallel Components, where not all are required for the group to be operational: Availability (Group) is greater than Availability of the most reliable component The more components “in parallel” which are allowed be in failed state without declaring the group failed, the higher the probability that the group is operational – eg a “one needed out three” arrangement is better than a “two out of three” (but more expensive) Many configurations are possible Calculations can be laborious! Logically Parallel Components (2)

76 76 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Identical Components in Parallel (1) n Consider the case for three identical components Availability (P) 0.9 A C B Unavailability (Q) = (1-0.9) = 0.1 Avail (Group) is dependent on which components must be operational If only one of A or B or C are required, then Avail (Group)=Avail (A) or Avail (B) or Avail (C) =(1- (Combined unavailability of group)) =(1- (0.1*0.1*0.1)) = 1-0.001 = 0.999 OR(1-Unavail 3 ) = (1-0.1 3 ) = 1-0.001 = 0.999

77 77 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Identical Components in Parallel (2) n Following two slides show a set of diagrams of common configurations of multiple components in parallel, and the formulas corresponding to each if the components are identical. n Space constraints prevent all terms in some formulas from being shown eg “+ (3 terms)” In these cases, look at the preceding terms and determine the pattern of P and Q usage, then repeat it for the missing terms n Spreadsheet Reliabli.xls (from the web page) calculates these formulas

78 78 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management 100% 50% 33% 50% 25% 33% 0 1 2 3 4 5 6 7 Basic 1 out of 2 1 out of 33 out of 4 4 out of 5 2 out of 3 3 out of 5 2 out of 4 Identical Components in Parallel (3)

79 79 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Diagram Comment Avail(Group) = 0 Basic Concept P1 1 1 out of 2, 100% redundancy 1-Q1Q2 2 1 out of 3, 200% redundancy 1-Q1Q2Q3 3 2 out of 3, 50% redundancy P1P2P3 +P1P2Q3+P1Q2P3+Q1P2P3 4 3 out of 4, 33% redundancy P1P2P3P4 +P1P2P3Q4+P1P2Q3P4+P1Q2P3P4+Q1P2P3P4 5 4 out of 5, 25% redundancy P1P2P3P4P5 +P1P2P3P4Q5+P1P2P3Q4P5 + (2 terms) +Q1P2P3P4P5 6 3 out of 5, 66% redundancy P1P2P3P4P5 +P1P2P3P4Q5 + (3 terms) +Q1P2P3P4P5 +P1P2P3Q4Q5 + (8 terms) +Q1Q2P3P4P5 7 2 out of 4, 100% redundancy P1P2P3P4 +P1P2P3Q4 +(2 terms) +Q1P2P3P4 +P1P2Q3Q4 +(4 terms) +Q1Q2P3P4 Identical Components in Parallel (4) NOTE: Spreadsheet Reliabili.xls on web page calculates these

80 80 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Spreadsheet Relibabli.xls Note: Calculations for ‘serial components’ are lower down sheet

81 81 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Hot and Cold standby n The spreadsheet shows Hot Standby and Cold Standby MTBF figures for groups in various configurations n MTBF and MTTR are set to show effects – real systems have MTBF much higher than this. Hot Standby assumes all units to be operating, and hence likely to fail even if they are not being used for operations Cold standby assumes units are not operating until required and hence are not likely to fail when not operational. Cold Standby also assumes instant changeover

82 82 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Typical Availability Diagrams (1) n Real Systems have both serial and logical groupings Eg a small network requires Server, Router and 4 /5 Terminals (forget cables, power, buildings, lights etc) But terminals are a group where only four out of five are required for the system to be considered ‘operational’ Problem is difficult - but can be simplified Server Router Terminal 1/1 4/5 1/1

83 83 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Typical Availability Diagrams (2) n The network diagram needs to be simplified by converting the five components “Terminals” to a single “Terminal Group” ie Solve the Terminals (four out of five) problem first, then the Avail (Network) problem becomes: n The problem is now a simple serial system with three components From this we can calculate system level (ie for whole network) –Availability –MTBF and –MTTR ServerRouter Terminals (as a group) Network =

84 84 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Availability Example n Network requires an Server, Router and 4/5 Terminals to be operational n Availability of Group (network) = A(s) * A(r) * A(t) = 0.9967*0.9975*0.9980 = 0.9922 n IE The availability of the network is less than the availabilities of any of the component items – the more items, the lower the availability ServerRouterTerminals (Group) Given MTBF600 hours2000 hours500 hours Given MTTR2 hours5 hours1 hour Calculated Availability A(s) = 0.9967A(r) = 0.9975A(t) = 0.9980 ServerRouter Terminals (as a group) Network =

85 85 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management MTBF/MTTR for System “Network” n Select a convenient period over which to calculate (Lowest common multiplier is good – in this case LCM(600:2000:500) = 6,000 n Let us say 6,000 hours In this time ‘network’ will require approximately: –6,000 / 600 = 10 server repairs or services @ 2 hours each = 20 hours –3 router repairs / services @ 5 hours each = 15 hours –12 terminal services @ 1 hour each = 12 hours Assuming independent failures, total outages = 25 outages in 6,000 hours, taking up (20 + 15 + 12) = 47 hours MTBF (Network) = 6,000 / 25 = 240 hours MTTR (Network) = 47 / 25 = 1.88 hours

86 86 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Availability and Redundant Equipment n With redundant equipment installed, there are several possible states: –Fully operational All equipment, including redundant equipment, is ready –Operating “at Risk” (ie some equipment failed, but full load being carried) eg a car just after replacing a flat tire with the spare wheel or a network needing 4 / 5 terminals, when one has failed and is not yet fixed –Degraded mode Only partial, but not full, operations being conducted due to various equipment outages - eg partial operations only, say several terminals failed –Not operational System is degraded so badly that it is defined as “not operational” Need to determine Availability for the group of equipments, and from that, an aggregate MTBF, (similar fashion to the example)

87 87 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design & Management Summary n General term for this topic is “Integrated Logistics Support” (ILS) n ILS is a specialised branch of engineering n This lecture covered only some of the basic concepts Call in a specialist for large or mission critical systems n Having a system with good RAM is more than calculations – it requires through-life management of design, maintenance, operations and change control (configuration management)


Download ppt "1 Copyright Ken Fletcher 2004 Australian Computer Security Pty Ltd Printed 26-Apr-15 15:31 Prepared for: Monash University Subj: CSE4884 Network Design."

Similar presentations


Ads by Google