Presentation is loading. Please wait.

Presentation is loading. Please wait.

Trust-Sensitive Scheduling on the Open Grid Jon B. Weissman with help from Jason Sonnek and Abhishek Chandra Department of Computer Science University.

Similar presentations


Presentation on theme: "Trust-Sensitive Scheduling on the Open Grid Jon B. Weissman with help from Jason Sonnek and Abhishek Chandra Department of Computer Science University."— Presentation transcript:

1 Trust-Sensitive Scheduling on the Open Grid Jon B. Weissman with help from Jason Sonnek and Abhishek Chandra Department of Computer Science University of Minnesota Trends in HPDC Workshop Amsterdam 2006

2 Background Public donation-based infrastructures are attractive –positives: cheap, scalable, fault tolerant (UW- Condor, *@home,...) –negatives: “hostile” - uncertain resource availability/connectivity, node behavior, end- user demand => best effort service

3 Background Such infrastructures have been used for throughput-based applications –just make progress, all tasks equal Service applications are more challenging –all tasks not equal –explicit boundaries between user requests –may even have SLAs, QoS, etc.

4 Service Model Distributed Service –request -> set of independent tasks –each task mapped to a donated node –makespan –E.g. BLAST service user request (input sequence) + chunk of DB form a task

5 BOINC + BLAST workunit = input_sequence + chunk of DB generated when a request arrives

6 The Challenge Nodes are unreliable –timeliness: heterogeneity, bottlenecks, … –cheating: hacked, malicious (> 1% of SETi nodes), misconfigured –failure –churn For a service, this matters

7 Some data- timeliness Computation Heterogeneity - both across and within nodes Communication Heterogeneity - both across and within nodes PlanetLab – lower bound

8 The Problem for Today Deal with node misbehavior Result verification –application-specific verifiers – not general –redundancy + voting Most approaches assume ad-hoc replication –under-replicate: task re-execution (^ latency) –over-replicate: wasted resources (v throughput) Using information about the past behavior of a node, we can intelligently size the amount of redundancy

9 System Model

10 Problems with ad-hoc replication Unreliable node Reliable node Task x sent to group A Task y sent to group B

11 Smart Replication Reputation –ratings based on past interactions with clients –simple sample-based prob. (r i ) over window  –extend to worker group (assuming no collusion) => likelihood of correctness (LOC) Smarter Redundancy –variable-sized worker groups –intuition: higher reliability clients => smaller groups

12 Terms LOC (Likelihood of Correctness), g –computes the ‘actual’ probability of getting a correct answer from a group of clients (group g) Target LOC ( target ) –the task success-rate that the system tries to ensure while forming client groups –related to the statistics of the underlying distribution

13 Trust Sensitive Scheduling Guiding metrics –throughput  : is the number of successfully completed tasks in an interval –success rate s: ratio of throughput to number of tasks attempted

14 Scheduling Algorithms First-Fit –attempt to form the first group that satisfies target Best-Fit –attempt to form a group that best satisfies target Random-Fit –attempt to form a random group that satisfies target Fixed-size –randomly form fixed sized groups. Ignore client ratings. Random and Fixed are our baselines Min group size = 3

15 Scheduling Algorithms

16 Scheduling Algorithms (cont’d)

17 Different Groupings target =.5

18 Evaluation Simulated a wide-variety of node reliability distributions Set target to be the success rate of Fixed –goal: match success rate of fixed (which over- replicates) yet achieve higher throughput –if desired, can drive tput even higher (but success rate would suffer)

19 Comparison gain: 25-250% open question: how much better could we have done?

20 Non-stationarity Nodes may suddenly shift gears –deliberately malicious, virus, detach/rejoin –underlying reliability distribution changes Solution –window-based rating (reduce  from infinite) Experiment: “blackout” at round 300 (30% effected)

21 Role of target Key parameter Too large –groups will be too large (low throughput) Too small –groups will be too small (low success rate) Adaptively learn it (parameterless) –maximizing  * s : “goodput” –or could bias toward  or s

22 Adaptive algorithm Multi-objective optimization –choose target LOC to simultaneously maximize throughput  and success rate s  1  2 s –use weighted combination to reduce multiple objectives to a single objective –employ hill-climbing and feedback techniques to control dynamic parameter adjustment

23 Adapting target Blackout example

24 Throughput (  1 =1,  2 =0)

25 Current/Future Work Implementation of reputation-based scheduling framework (BOINC and PL) Mechanisms to retain node identities (hence r i ) under node churn –“node signatures” that capture the characteristics of the node

26 Current/Future Work (cont’d) Timeliness –extending reliability to encompass time –a node whose performance is highly variable is less reliable Client collusion –detection: group signatures –prevention: combine quiz-based tasks with reputation systems form random-groupings

27 Thank you.


Download ppt "Trust-Sensitive Scheduling on the Open Grid Jon B. Weissman with help from Jason Sonnek and Abhishek Chandra Department of Computer Science University."

Similar presentations


Ads by Google