Presentation on theme: "FAULT MANAGEMENT. Definition can be defined as the real-time or near-real-time monitoring of the elements of a computer communications network with attendant."— Presentation transcript:
Definition can be defined as the real-time or near-real-time monitoring of the elements of a computer communications network with attendant resolution of related problems. This function addresses both hardware and software activities that are part of the system operation. These elements are both physical and functional, and can include connectivity errors, equipment failures, performance bottlenecks, and performance discontinuities. Fault management may obtain some of its inputs from other functions, such as configuration management, but the evaluation of inputs is irrelevant to their source.
Description The fault management function interfaces with the various system components sometimes directly, as with network elements, or sometimes indirectly via subnetwork managers, such as LAN managers. In the OSI scheme for network management, the fault management function, as with the other interfaces with the network through the Systems Management Applications Entity(SMAE), that is located at each node, via the Common Management Interface Protocol(CMIP)gateway. A node may be anything from a single piece of routing equipment to a complex array of data processors. The key feature is that the node must somehow terminate the incoming data channels and reconnect them at the outgoing side.
Fig2-2 illustrates the basic issues involved in signal detection. The crosshatched area labeled as the probability of detection, P(D), represents the probability of detection of a fault, given that a fault has actually occured. Sometimes, a fault is declared when, in fact, such is not the case, and this is represented as the probabilit of a false alarm, P(FA). The curve labeled as noise includes all non-fault or false alarm reports. In some cases, signals that pass the criteria for declaration as actual fault events are, in fact, not faults at all. Such events are false alarms. Depending upon where the detection threshold is placed, the likelihood of experiencing more or fewer such false alarms is altered. As the detection threshold is shifted to the left, there is a higher probability of experiencing false alarms. The positive aspect of moving the detection threshold to the left is that there is also a higher probability of detecting all actual faults. The trick is to separate the noise and event curves as much as possible in order to maximize the detection capability while, at the same time, minimizing the possibility of risking false alarams.
Fig 2-3 shows the conceptual causes of ambiguity In some systems, status information may come over separate channels, or may be interleaved with the operational data. In Case A, the accumulation of information about the status of the system elements is spread over n number of elements, thus confusion may arise as to which elements are at fault when all status reports are processed. Even though status reports are usually identified with the equipment may ripple through the data stream to affect equipment in other parts of the system. In Case B, possible simultaneous failures, or single failures that affect all channels at the same time, may confuse the deciphering of status messages. Ambiguity may arise where there is decoupling between the data traffic configuration and the incoming status messages.
Fault management must satisfy one or more of the following ten tasks: 1.Spontaneous Error Reporting -An SMAE, at each network management nodal interface of the network, can send and receive timely error reports between itself and another SMAE. 2. Cumulative Error Gathering -A designated SMAE can gather error information on behalf of another SMAE within the system. The designated SMAE can poll error counters within other SMAEs on a periodic basis and can reset each counter as it is polled. 3. Error Treshold Alarm -Any SMAE can be configured to send threshold reports to another SMAE with previously set error thresholds, and current thresholds can be determined. Finally, the resetting of counters used to compare thresholds can be accomplished. 4. Event Logging -Any SMAE can send all event reports to another SMAE, providing for the initialization and termination of event logging.
5. Confidence and Diagnostic Testing -Any SMAE may request any other SMAE to perform testing and to report back to the requestor the results of such testing. 6. Repair Action Reporting -An SMAE may request the status, from another SMAE, of any resource that has been previously been reported as faulty. 7. Trace Communication Path -Cooperating SMAEs can test interconnecting communications paths and report results back to a requesting SMAE. 8. Resouce Reinitialization -An SMAE can request another SMAE to set the initial state(s) of some resouce to a known parameter(s). 9. Event Tracing -One SMAE can request another SMAE to start or stop logging specific avents locally, and to report back the status of this exercise. 10. Fault Management Information Gathering -This facility provides for one SMAE to collect, dump, and analyza local information so as to support other SMAEs making such requests.
Faults and Failure Mechanisms 1.Single Stuck-at Faults -These faults occur due to an inability to make a transition between logic levels in some circuit. This is usually caused by alternate failures within the circuitry or some type of deposition of material during the manufacturing process. 2. Multiple Stuck-at Faults -These types of faults may be the result of manufacturing defects caused by over-etching, an incorrect PC layer, etc. For non-interactive faults in this category, tests can be divised to expose each separately, while interactive faults may not be revealed with standard tests. 3. Bridging Faults -These faults can be thought of as variations of the single and multiple stuck-at faults. Foreign material touching two or more component parts or circuit traces is the most common form of bridging fault. A not-so-elegant approach to such problems has been to implement trace cuts as a fix.
4. Intermittent Faults -Any fault that occurs and then disappears without intervening action taken to correct the fault is an intermittent fault. Such faults are usually discovered by examining the physical implementation of the unit in question. Vibration is likely cause of intermittent faults, as is thermal stress. 5. Memory Faults -Chip miniaturization with its constraints upon separations is a major cause of such faults. Tighter and tighter constraints brought about by transitions form LSI to VLSI to VHSIC technology where the separtions between traces is 0.5 microns or less, causes such faults to be more common. 6. Time-Dependent Faults -Such faults are usually linked to physical rather than electronic roots. Varying the memory refresh times of a display monitor are an example of such faulty conditions that can be attributed to physical conditions.