Evaluating Provider Reliability in Risk-aware Grid Brokering Iain Gourlay

2 Outline AssessGrid background Problem Statement Basic Reliability Analysis of behaviour Stationarity Problem Weighted Reliability Simulations and Results What if a provider is unreliable? Alternative: Bayesian Inference Summary and Conclusions

3 AssessGrid Background AssessGrid addresses Risk Management in the Grid. This is a necessity in the drive towards commercialisation of Grid technology… - The goal is to move beyond best-effort, using SLAs to specify agreed upon level of service. However, - For resource providers, offering an SLA with service guarantees and penalties is a business risk! - For end-users, agreeing to an SLA is a business risk! A large part of AssessGrid is concerned with methods to support providers with tools and methods to: - Monitor and collect useful data. - Assess risk associated with accepting an SLA request, based on this data.

4 What is risk? Risk is Hazard, danger, exposure to mischance or peril (Oxford English Dictionary). Risk Management is a discipline that addresses the possibility that future events may cause adverse events. - Economics, Operations Research, Engineering, Gambling, … In Risk Management, risk is quantified with two parameters: Risk = Probability of Occurrence x Impact Grid computing: Event is SLA failure!

5 Scenario

6 Role of the Broker Key role: Finding/Negotiating with providers on behalf of end-users. Broker can also act as an independent party: - Providers may have motivation to lie! - Providers may have unidentified problems in their infrastructure. Here, we assume the broker is independent and honest. Broker can give a second opinion on risk assessments. Broker can agree its own SLAs (virtual provider).

7 Problem statement: What do we mean by reliability? A provider makes an SLA offer: - includes an estimate of the Probability of Failure (PoF). Each time an offer is accepted, the details are stored in a database, including: - Final status (Success/Fail) - Offered PoF The problem is: Given a providers past data, can their risk assessments be considered reliable?

8 What is reliable? Considering only systematic errors! Assume s SLAs in the database for the same provider. - Offered PoFs, Assume number of fails ~ We define a reliable provider as one that does not systematically underestimate or overestimate the PoF, so that:

9 Is it normal?

10 Is it normal? (2)

11 Basic Reliability: Identifying Systematic Errors Using the providers offered PoFs: The evaluation is based on the following measure:

12 Basic Reliability: Identifying Systematic Errors(2)

13 Basic Reliability: Identifying Systematic Errors(3) We note that and recall the condition, leading to

14 Analysis: How does the measure behave? Simple Example: m SLAs in database. Offered PoF is constant, p. There is a systematic overestimation/underestimation of the PoF, such that:

15 Analysis (2)

16 Stationarity Problem Conditions are not static! - Example: 60 red balls in a bag. 40 blue balls in the same bag. You try to estimate the number of red balls by taking a ball out and replacing it, repeating this 50 times. Someone is secretly removing a red ball and replacing it with a blue after every sample. E(red) =17.5 Number of reds =10!

17 Stationarity Problem(2) A providers behaviour could change as a consequence of a variety of factors, e.g. A providers infrastructure is updated. A providers risk assessment methodology or model parameterisation may change. A providers policy may change, for example due to economic considerations.

18 Weighted Reliability Use a weighted average, ensuring more recent SLAs have a larger influence. Total of mk SLAs are split into k categories, with the k th consisting of the most recent SLAs. Here, is the basic measure R over the i th category.

19 Simulations A database of SLAs is generated: - Each SLA object has an offered PoF, true Pof and final status. Reliability computed. Process repeated 10000 times for each scenario. Simple case considered here: - Offered PoF is fixed and true PoF is fixed.

20 Results

21 Results(2)

22 Results (3)

23 Results(4)

24 Results (5)

25 What if the provider is unreliable? Discrete approximation: When SLA Offer received with offered POF of p, estimate POF by looking at failure rate for all SLAs with offered POF of ~p. Then, If (|reliability measure| < threshold) Believe provider. Else(PoF estimate = numFails(POF~p)/numSLAs(POF~p) Use all SLAs with offered PoF within x% of the offered PoF in the current SLA.

26 Weighted Average risk assessment Split km SLAs into k categories. Compute the estimate PoF, for each category, i=0,…,k-1.

27 Never Trust Doctors You are tested for a disease, which 2% of the population has. The test never gives a false-negative. If you are clear, there is still a 5% chance of a false positive. You test positive. What is the probability you have the disease?

28 Alternative Approach: Bayesian Inference The provider offers a linguistic risk assessment, e.g. the failure probability is: - extremely low: <1% - very low: 1-5% - low: 5-10% - medium: 10-20% - high: 20-30% - very high:30-50% - extremely high: >50% If the broker/end-user requests the PoF exact value this can be provided.

29 Alternative Approach: Bayesian Inference (2) The broker does not consider the providers reliability directly. Instead it takes the following approach: - Having received a linguistic risk assessment for a new SLA, the broker first computes a prior distribution for the PoF, given the linguistic category by considering data across all other providers. - The broker computes a posterior distribution, based on the failure rate observed in past SLAs from the same provider with the same linguistic risk assessment. - The broker returns an object which contains: (PoF_broker, confidence)

30 Alternative Approach: Bayesian Inference (3)

31 Summary/Conclusions A detailed analysis has been carried out for a method to identify providers who are systematically unreliable. The stationarity problem has been addressed. - Weighted Average - Results indicate good performance relative to basic measure and moving average. This can be extended to other measures for non-systematic errors. Bayesian approach has been considered and is also promising.

