Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deepak Poola Chandrashekar

Similar presentations


Presentation on theme: "Deepak Poola Chandrashekar"— Presentation transcript:

1 Deepak Poola Chandrashekar
PhD Completion Seminar Robust and Fault-Tolerant Scheduling for Scientific Workflows in Cloud Computing Environments Cloud Computing and Distributed Systems (CLOUDS) Laboratory, Department of Computing and Information Systems The University of Melbourne, Australia Deepak Poola Chandrashekar Supervisor: Prof. Rajkumar Buyya Co-Supervisor: Prof. Kotagiri Ramamohanarao

2 Clusters Super Computers Grid Cloud Computing
Web Service oriented architecture Clusters Super Computers Grid Cloud Computing Provenance, reliability, results derivation, and sharing. Shell Scripts and Complex Job-Control Language. Streaming Workflow, IoT based workflows. RPC, DFS, databases, Distributed Object Technologies Specialized worfklows for data intensive, compute-intensive, instance- intensive applications

3 Workflows Scientific workflow systems aim at automating large complex data analysis to make it easier for scientists. Workflows are collection of tasks that are data dependent or control dependent. Workflows can be represented as Directed Acyclic Graph Workflow scheduling maps tasks to resources whilst maintaining dependencies Jargons Makespan Cost Deadline Budget Sample Workflow With the increase in processing power and computation tools, computation has become a “third branch” in scientific research along theory and experiment. Scientific worklfow aims to automate these complex analysis and facilitates collaborative research. 3

4 Dynamically configured and delivered on demand Highly scalable
Cloud Computing Cloud Computing On-Demand Dynamically configured and delivered on demand Highly scalable Provide different levels of services Driven by market principles Pay as you go cloud computing is a model for enabling ubiquitous, convenient, on-demand access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. 4

5 Cloud Variants Image Courtesy: Windows Azure 5

6 Other Inherent Challenges
Cloud Computing Performance variance among different types of VMs Resource-level failures Virtualization overhead Failures or outages in Clouds Other Inherent Challenges Network latencies and failures Task-level failures Failures associated with storage and I/O Failures in workflows management systems

7 Started by Amazon around December 2009
Spot Instances Started by Amazon around December 2009 Idle or unused datacenter capacity Spot price is decided in an Auction-like mechanism Varies with time and instance type Varies between regions and availability zones bid should be higher than or equal to the spot price Offers upto 60% cost reductions

8 Thesis Research Question
How to make workflow scheduling algorithms robust and fault-tolerant for cloud computing environments?

9 Novel scheduling heuristics Resource provisioning policies
Thesis Contributions Novel scheduling heuristics Resource provisioning policies Bidding Strategies A performance evaluation study on time, cost and fault-tolerance Metrics to measure robustness and fault-tolerance Multi-cloud resource plug-in

10 Thesis Organization

11 Derived from Publication
Literature Survey Derived from Publication Poola D., Salehi, M. A., Rao K., and Buyya R., A Taxonomy and Survey of Fault-Tolerant Workflow Management Systems in cloud and Distributed Computing Environments, Submitted to the ACM Computing Surveys. Under Review.

12 Components of workflow scheduling

13 Taxonomy of Fault-Tolerant Scheduling Techniques

14 Existing Workflow Management Systems

15 Thesis Organization

16 Robust Scheduling of Scientific Workflows
Derived from Publication Poola D., Garg S.K., Buyya R., Yang Y., and Rao K., Robust Scheduling of Scientific Workflows with Deadline and Budget Constraints in clouds, Proceeding of the 28th IEEE International Conference on Advanced Information Networking and Applications (AINA-2014), Victoria Canada.

17 Robustness is maximized Makespan and cost is minimized
Objective Find a schedule to map workflow tasks on heterogeneous Cloud resources such that Robustness is maximized Makespan and cost is minimized Constraints : Deadline and Budget To the best of our knowledge, there has been no study in workflow scheduling algorithm for Clouds maximizing robustness, and minimizing makespan and cost at the same time. Also there are very few works which schedule workflow tasks on heterogeneous Cloud resources. This study tries to address these shortcomings. 17

18 Absorb some degree of uncertainty While maintaining a stable solution
Robustness Absorb some degree of uncertainty While maintaining a stable solution Robustness alone is not a metric It gives an idea of the stability of the solution Much needed in mission and time critical applications Achieved with redundancy in time or space 18

19 Uncertainties: two types are considered Task failures
System Model Cloud Environment : A single datacenter with heterogeneous resources are considered. Uncertainties: two types are considered Task failures Performance variations of VMs Workflow: Represented as a DAG Data transfer time between tasks are considered

20 Total cost: sum price of all VMs of different types used.
Definitions Makespan: is the total elapsed time required to execute the entire workflow Total cost: sum price of all VMs of different types used. Robustness Metric: Robustness probability: is the likelihood of the workflow to finish before the given deadline Tolerance time: is the amount of time a workflow can be delayed without violating the deadline constraint

21 A workflow is divided into a set of partial critical paths PCP
Scheduling Algorithm A workflow is divided into a set of partial critical paths PCP PCP of a node ti is a unassigned critical parents (CP) tp , and critical parent of tp recursively until there are no further unassigned parents CP is the parent of the node, for whom the combination of start time, execution time and transfer time is the maximum A suitable resource is selected for this PCP 21

22 2 1 4 1 5 1 2 4 5 3 3 2 3 2 4 3 5 22

23 2 1 4 1 5 1 2 4 5 3 3 2 3 2 4 3 5 23

24 2 1 4 1 5 1 2 4 5 3 3 2 3 2 4 3 5 24

25 2 1 4 1 5 1 2 4 5 3 3 2 3 2 4 3 5 25

26 2 1 4 1 5 1 2 4 5 3 3 2 3 2 4 3 5 26

27 2 1 4 1 5 1 2 4 5 3 3 2 3 2 4 3 5 27

28 Scheduling Algorithm (Contd...)
Robustness Type No Robustness Slack One node failure Two node failure Solution Set $ $ $ $ RT RT RT RT VM VM VM VM RT $ RT $ RT $ RT $ VM VM VM VM RT $ RT $ RT $ RT $ VM VM VM VM RT $ RT $ RT $ RT $ VM VM VM VM 28

29 Scheduling Algorithm (Contd...)
PCP assigning policy Robustness-Cost-Time (RCT) Robustness-Time-Cost (RTC) Weighted Feasible Solution Set $ RT VM RT $ RT $ RT $ VM VM VM $ RT VM $ $ $ RT RT RT VM VM VM Solution $ $ RT RT VM VM 29

30 Gravitational Physics
Simulation setup Application Model Considered 5 workflows as presented by Bharathi et.al (WORKS’08). Montage: Astronomy Epigenomics: Biology LIGO: Gravitational Physics SIPHT: Biology Cybershake: Earthquake Science 30

31 Considered 10 types of VMs.
Simulation setup Resource Model Considered 10 types of VMs. Fastest VM has 5 time more processing power than the slowest VM, the prices varies proportionately. VMs differs with respect to memory, CPU cores, MIPS and OS Virtually unlimited resources is assumed Data transfer costs are not considered Storage costs are not considered The Cloud in our model has only one Datacenter 31

32 Failure Model: Two types of failure models are considered.
Simulation setup Failure Model: Two types of failure models are considered. Failure traces: simulated from traces. Used Condor Grid failure dataset from failure trace archive. Failure probability: 10% failure probability is considered. Checkpointing mechanism is assumed Performance variation (ICSOC’09) Random variable from a normal distribution is added to the execution time Standard deviation of this distribution is a function of the execution time 32

33 Reference Algorithms: Two reference algorithms are used.
Simulation setup Reference Algorithms: Two reference algorithms are used. ICPCP : (FGCS’12) Robust Bi-objective genetic algorithm (GA) : (ICoCC’2006) These algorithms are chosen for their similarity with our approach. ICPCP schedules tasks by grouping them into PCPs, similar to our algorithm and GA tries to maximize the slack time to be robust, which is the approach we adapt as well. 33

34 Results : Effect on robustness
Tolerance Time 34

35 Results : Effect on robustness
Robustness Probability 35

36 Results : Effect on Makespan
36

37 Results : Effect on Cost
37

38 We presented three resource allocation policies.
Summary We presented three resource allocation policies. Objectives being robustness, makespan, cost. Add slack time judiciously to make the schedule robust. Tested with 2 failure models, 5 workflows and 2 robustness metrics. Results indicate our policies provide robust schedules with minimal makespan with slight increase in cost.

39 Thesis Organization

40 Fault-Tolerant Scheduling Using Spot Instances
Derived from Publication Poola D., Rao K., and Buyya R., Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds, Proceeding of the 13th International Conference on Computational Science (ICCS-2014), Cairns Australia.

41 Just-in-time and adaptive scheduling heuristic
Objective Just-in-time and adaptive scheduling heuristic Using spot and on-demand instances An intelligent bidding strategy Minimizes the execution cost Providing a robust schedule Satisfying the deadline constraint To the best of our knowledge, there has been no study in workflow scheduling algorithm for Clouds maximizing robustness, and minimizing makespan and cost at the same time. Also there are very few works which schedule workflow tasks on heterogeneous Cloud resources. This study tries to address these shortcomings. 41

42 Workflow is represented a DAG Makespan is the total elapsed time
Background Workflow is represented a DAG Makespan is the total elapsed time Pricing models On-Demand Spot Critical Path is the longest path from the start node to the exit node

43 Latest Time to On-Demand (LTO)
It is the latest time the algorithm has to switch to on-demand instances to satisfy the deadline constraint Start LTO Deadline Spot Instances On-Demand

44 We use Downey’s analytical model Downey’s model requires:
Runtime Estimation We use Downey’s analytical model Downey’s model requires: task’s average parallelism, A, coefficient of variance of parallelism, σ, task length the number of cores Cirne et al model to generate A and σ

45 Estimates the failure probability of a particular bid price
Failure Estimator Estimates the failure probability of a particular bid price Based on spot price The history price of one month prior is considered Total time of the spot price history, HT And total out of bid time, OBTbidt is measured

46 Scheduling Algorithm

47 Scheduling Algorithm (Contd..)

48 Scheduling Algorithm (Contd..)

49 Two type of Scheduling Algorithms
Conservative: CP and LTO is estimated on the lowest cost instance. CP is the longest, hence less slack time Uses spot instances cautiously under relaxed deadlines Aggressive: CP and LTO is estimated on the highest cost instance. CP is smallest, hence more slack time opt on-demand instances that are expensive under failures

50 Intelligent Bidding Strategy Current spot price (pspot)
On-demand price (pOD) Failure probability (FP) of the previous bid price LTO Current time (CT) α β

51 Intelligent Bidding Strategy
α : dictates how much higher the bid value must be above the current spot price β : determines how fast the bid value reaches the on-demand price FP of the previous bid is used as a feedback to the current bid price

52 Intelligent Bidding Strategy

53 Other Bidding Strategies
On-Demand Bidding Strategy : uses the on-demand price as the bid price. Naive Bidding Strategy: uses the current spot price as the bid price for the instance

54 CloudSim was used for simulation
Simulation Setup CloudSim was used for simulation LIGO workflow with 1000 tasks was considered For On-Demand 9 different VMs types were considered For Spot, 1 VM type was used

55 Results : Comparison between algorithms
Mean execution cost of algorithms with varying deadline (with 95% confidence interval)

56 Results : Comparison between bidding strategies
Mean Execution Cost of bidding strategies with varying deadline (with 95% confidence interval)

57 Results : Task Failures
Mean of task failures due to bidding strategies

58 Results : Checkpointing

59 They minimize the execution cost
Summary Two scheduling heuristics that map workflow tasks onto spot and on-demand instance are presented They minimize the execution cost They are robust and fault-tolerant towards out-of-bid failures and performance variations A bidding strategy that bids intelligently to minimize the cost is presented Demonstrates the use of checkpointing, which offers cost savings up to 14%

60 Thesis Organization

61 Workflow Scheduling Using Task Replication
Derived from Publication Poola D., Rao K., and Buyya R., Enhancing Reliability of Workflow Execution Using Task Replication and Spot Instances, Submitted to the ACM Transactions on Autonomous and Adaptive Systems. Under Review.

62 Just-in-time and adaptive scheduling heuristic
Objective Just-in-time and adaptive scheduling heuristic Using spot and on-demand instances Task replication strategy. Minimizes the execution cost Providing a robust schedule Satisfying the deadline constraint To the best of our knowledge, there has been no study in workflow scheduling algorithm for Clouds maximizing robustness, and minimizing makespan and cost at the same time. Also there are very few works which schedule workflow tasks on heterogeneous Cloud resources. This study tries to address these shortcomings. 62

63 Essentially Critical Task

64 Proposed Solution: Heuristics
Scenario 1: Mapping task on already running spot instances LTO ahead of current time Map tasks on already running spot instances Saves costs Consolidates resource utilization Among available free slot, the resource which can finish within the tasks’ latest finish time is chosen

65 Proposed Solution: Heuristics
Scenario 2: Mapping task on a new spot instance If no free slot that honor deadline are found If there is sufficient time to start a spot instance Bid price is estimated New spot instance is created

66 Proposed Solution: Heuristics
Scenario 3: Mapping task to an on-demand instance LTO is before the current time Finds suitable on-demand instance Critical path for all instance types are calculated First finds a free slot If no free slot, then find instance that can honor deadline No instance, then create an on-demand instance

67 Proposed Solution: Heuristics
Scenario 4: Task duplication under short deadline Critical tasks are replicated to provide fault-tolerance Two variants are proposed Essential Critical Path Task Replication (ECPTR) ESCTs are replicated Critical Task Replication(CTR) All critical tasks are replicated

68 Results: Effect on Fault-Tolerance
Failure Probability Tolerance Time

69 Results: Effect on Makespan
Mean Makespan Resource Consolidation

70 Results: Effect on Cost
Mean Cost Resource Consolidation

71 Thesis Organization

72 Multi-Cloud Integration for WFMS
We extend cloudbus workflow management system The broker is enhanced to provision resources dynamically Effective fault-tolerant technique is developed Resource provisioning algorithm for multi-cloud is proposed

73 Cloudbus Workflow Management System
Workflow Portal I/O Models Many-to-many, many-to-one, synchronization Workflow Engine Workflow submission handler Workflow scheduler Event Service Resource Broker

74 Components of workflow scheduling

75 Motivation for Multi-Cloud
Pricing Billing Periods Resource Characteristics Private Cloud Regulatory and Legal

76 Multi-cloud resource provisioning policy
Create a pool of spot instances If free resource allocate Else, if budget is there, create an on-demand instance If no budget Find a suitable resource that will complete the task in the earliest

77 Effect on Makespan under failures Resource Instantiation time
Results Effect on Makespan under failures Resource Instantiation time

78 Output

79 Proposed novel fault-tolerant heuristics
Summary Proposed novel fault-tolerant heuristics Experimented with various fault-tolerant techniques Extensive literature survey Resource allocation policies Bidding Strategies Multi-cloud integration for workflow management systems

80 Future Research Directions
Cloud Failure Characteristics Metrics for fault-tolerance Cloud Pricing Models Multiple tasks on a single instance Workflow specific scheduling Multi-cloud challenges

81 © Copyright The University of Melbourne 2009


Download ppt "Deepak Poola Chandrashekar"

Similar presentations


Ads by Google