Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.

Similar presentations


Presentation on theme: "Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge."— Presentation transcript:

1 Microsoft Reseach, CambridgeBrendan Murphy. BMurphy@Microsoft.com1 Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.

2 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 2 Agenda History of monitoring systems in the field. Characterizing the behaviour of individual systems. Characterizing the behaviour of multiple systems and applications.  Problems and opportunities.

3 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 3 Background to field measurements. Computer manufacturers (mid 80’s)  Hardware failure rates improving.  Difference between theoretical and actual reliability.  Software reliability becoming a bigger driver of overall system reliability.  Changing customer profile and therefore expectations.

4 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 4 Initial observations for analysing system behaviour Hardware reliability could be measured. Software reliability was more difficult to measure.  Crash rate could be measured but difficult to interpret. Was the crash due to a defect or an operator error?  Software life cycle impacts its failure rate. Operation errors started to become more important. See Jim Gray’s paper from the early 90’s Still unclear how to use metrics as a measurement of “Goodness”?

5 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 5 Does the following represent Goodness? Failure breakdown by Service company of systems in Microsoft.

6 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 6 Measuring system Reliability Need for Filtering Reliability calculations impacted by clusters of crashes (NT data collected from DOT COM sites).

7 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 7 Reliability Measurement Events to measure System Crashes/Panics/Bluescreens.  Good Points. Each event represent a defect.  Bad points. Does not include hangs. More a measure of fault management. System Reboots  Good Points. Captures all defects.  Bad points. Captures all system management activity Can only be applied to servers.

8 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 8 Definition of a System Events is Operating System dependent. A crash is an action taken by the system fault management that  Shuts down the system gracefully  Write the cause to a dump file and an event log. Note event logs for UNIX & NT are derived from the VMS event log. A system outage is captured by a reboot event occurring in the event log.  A hang can sometimes be recognized by lack of outage information.

9 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 9 System Availability Measurement. Using data in the Event Log.  If a shutdown and reboot event are captured then easy to calculate.  If only reboot events exist then use timestamps or use last event prior to shutdown. Tools to monitor availability.  Pinging system Dependent upon Network availability.  Background process continually logging timestamps.

10 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 10 Interpreting System Availability VAX 6000 The problems start. ©FTSC 1999 Madison, Murphy, Davies Compaq Corporation. This level of availability implies systems are unlikely to be in a production environment!

11 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 11 Measuring availability Ignore workstations/clients. “Intelligently” filter out long outages. “Intelligently” filter out non-production systems. Differentiate between system maintenance outages and those due to ‘reliability’.  Capture cause of outage from system managers. Beware they do not always tell the truth!  Assume usage by time of event.

12 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 12 Assuming usage based on day of event. Distribution Of System Outages 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% MondayWednesdayFridaySunday System Outages System Crashes VAX 6000 Systems Windows 2000 Systems ©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.

13 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 13 Distribution of System Outages based on Time of Event (measured Monday – Friday). Windows 2000 VAX 6000 ©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.

14 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 14 Generic rules for measuring system behaviour. All reliability analysis requires a filter. Workstation/client behaviour can only be characterized by its crash rate. Reliability of servers can be measured using either the crash or system outage rate. Availability is most accurately measured during Peak usage periods. Site availability calculated using the median availability OR through removing outliers.

15 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 15 The hard part :- Interpretation! Availability and reliability is seasonal. Reliability of large servers is affected by their life cycle. Software reliability is affected by its time since installation and also its life cycle. Comparisons between different products is very difficult.

16 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 16 Reliability of servers over its life cycle. ©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.

17 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 17 Impact of installation on (VMS) Operating System Behaviour ©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.

18 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 18 Impact of (VMS) software life cycle on its Reliability. ©FTSC 1999 Madison, Murphy, Davies Compaq Corporation. Operating System behaviour improves with age? Few new patches are produced 6 months after the release of any version of the Operating System.

19 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 19 Comparing (VMS) System Reliability Using data collected from the same point in the life cycle. Only includes systems installing OS within 6 months of release ©FTSC 1999 Madison, Murphy, Davies Compaq Corporation.

20 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 20 Overall rules for characterizing system behaviour. Hardware failure rates can be fully characterized. Software behaviour is characterized by its reliability, availability and instability. As software matures then crash rates are no longer the definitive measure of reliability. User perception is not necessarily reality. Comparisons between versions of software can be performed, with care. Comparisons between products is difficult, requires knowledge of product and usage characteristics.

21 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 21 Monitoring applications running across distributed systems. Three perspectives of system behaviour  Application behaviour on individual systems  Total application behaviour from the System Managers perspective.  Total application behaviour from user perspective. Four Measurements.  Reliability.  Availability  Instability  Degradation

22 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 22 Problems associated with ‘solution’ analysis. The relative importance of the metrics, varies between users of the metrics. The users do not have a consistent set of requirements for the application. The configuration of the distributed solution changes over time. Very little research has been performed into the behaviour of solutions on customer sites.  i.e. big opportunities.

23 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 23 Examples of analysis of a ‘distributed’ solution. VMS Clusters The behaviour of individual systems was captured. The configuration of the cluster was captured. Correlating the data gives cluster behaviour. Node A Node B Node C Cluster Down ?

24 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 24 Characterizing VMS Cluster Behaviour. ©FTSC 1999 Madison, Murphy, Davies Compaq Corporation. OpenVMS VAX Cluster Behaviour 0 10 20 30 40 50 60 123456 Servers in Cluster Annual Rate of Outages 0 0.5 1 1.5 2 2.5 3 3.5 Average Downtime Cluster Reliability Cluster Downtime

25 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 25 Characterizing VMS Cluster Behaviour. Characterizing instability (recoverability). ©FTSC 1999 Madison, Murphy, Davies Compaq Corporation. OpenVMS Cluster Behaviour Using a 1 week Filter 0 1 2 3 4 5 6 7 8 123456 Number Of Servers in cluster Annual Rate of Outages 0 10 20 30 40 50 60 70 80 90 100 Period of Instability Cluster Reliability Periods of Instability

26 Microsoft Reseach, Cambridge Brendan Murphy. BMurphy@Microsoft.com 26 Opportunities for research into characterizing solution behaviour. Developing metrics to characterize solution behaviour. Understand the relationships between the metrics.  E.g. identify network availability as the difference between the end user and system availability. Correlating the relationship between configuration and end user behaviour. Difficulty is monitoring production sites.


Download ppt "Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge."

Similar presentations


Ads by Google