Presentation is loading. Please wait.

Presentation is loading. Please wait.

DataSys Laboratory Dr. Ioan Raicu Michael Lang, USRC leader Abhishek Kulkarni, Ph.D student of Indiana University Poster Submission: –Ke Wang, Abhishek.

Similar presentations


Presentation on theme: "DataSys Laboratory Dr. Ioan Raicu Michael Lang, USRC leader Abhishek Kulkarni, Ph.D student of Indiana University Poster Submission: –Ke Wang, Abhishek."— Presentation transcript:

1

2 DataSys Laboratory Dr. Ioan Raicu Michael Lang, USRC leader Abhishek Kulkarni, Ph.D student of Indiana University Poster Submission: –Ke Wang, Abhishek Kulkarni, Michael Lang, Ioan Raicu, Andrew Lumsdaine, “Exploring the Design Tradeoffs for Exascale System Services through Simulation”, under review at SC12 Exploring the Design Tradeoffs for Exascale System Services through Simulation2

3 Introduction & Motivation Long-Term Aims and Contributions System Services Taxonomy Peer-to-Peer System Simulators Simulating System Services Related Work Contributions Future Work & Conclusion Exploring the Design Tradeoffs for Exascale System Services through Simulation3

4 Introduction & Motivation System Services Taxonomy Peer-to-Peer System Simulators Simulating System Services Related Work Contributions Future Work & Conclusion Exploring the Design Tradeoffs for Exascale System Services through Simulation4

5 Operating System: a service provider offers basic services, such as Program development, Access to I/O devices, Controlled access to files, System access and Program execution Generalized distributed system services involve many servers coordinated with each other to offer different services to a lot of clients Typical services: key-value store, job scheduler, file servers, application job launch Key Issues: Scalability, Dynamicity, Resiliency, Consistency, Fault Tolerance Exploring the Design Tradeoffs for Exascale System Services through Simulation5

6 Top500 Performance Development, http://top500.org/static/lists/2011/11/TOP500_201111_Poster.pdf 6 Today (June 18, 2012): 16 Petaflop –O(100K) nodes (100X in the last 10 years) –O(1M) cores (1000X in the last 10 years) Near future (~2018): Exaflop Computing –~1M nodes (10X) –~1B processor-cores/threads (1000X)

7 Energy and Power –7.89MW (Top 1 Supercomputer) –20MW limitation Memory and Storage –Retain data at high enough capacities –Access data at high enough rates –Support the desired computational rate –Fit within acceptable power envelope Concurrency and Locality –Accelerators, GPUs, MIC –Programmability –Minimizing data movement Resiliency –MTTF decreases, MPI suffers 7Exploring the Design Tradeoffs for Exascale System Services through Simulation

8 Lack of decomposition in detail Centralized server with at most a single fail-over, for example Slurm (slurmctld, slurmd) Not clear about the scalability of different server topologies (centralized, hierarchical, distributed), either the costs of different resiliency and consistency models.

9 Introduction & Motivation Long-Term Aims and Contributions System Services Taxonomy Peer-to-Peer System Simulators Simulating System Services Related Work Contributions Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales9

10 Develop a simulator capable of simulating generic system services in supporting up to 1M nodes Compare the scalability of different server topology with or without churn property The costs of different resiliency models (fail over, replication) to different server topology under different failure rate The costs of different consistency models (strong/weak consistency) to different server topology SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales10

11 –Deconstruct services into their most basic components, and provide a general taxonomy to classify existing system services in terms of server architecture –Investigate and compare different existing peer-to-peer simulators –Simulate these service architectures at scale with millions of clients served by thousands of servers –Estimate basic parameters such as memory consumption analytically, and complex parameters such as client-perceived throughput, server throughput, and overall system efficiency –Demonstrate how churn property affects the performance and efficiency of the system under different distributed service architectures SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales11

12 Introduction & Motivation Long-Term Aims and Contributions System Services Taxonomy Peer-to-Peer System Simulators Simulating System Services Related Work Contributions Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales12

13 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales We deconstruct services into their basic blocks to understand the design tradeoffs for exascale system services The taxonomy proposed is still being defined and modified as we investigate more HPC services Components –Service Model: define overall behavior and constraints  describe high-level functionality, its architecture and roles of entities  ACID property, CAP property –Data Model: define the distribution of persistent data  Centralized, Distributed with different levels of replication  Replication: partitioned(no replication), mirrored(full replication), overlapped (partial replication) –Network Model: dictates how the components are connected  Structured overlay: rings, binomial, k-ary, radix-trees, complete/binomial graphs  Unstructured overlay: random graph  Completed membership list (fully connected) vs Partial membership list (binomial graphs) –Failure Model: how the servers handle failures  Complete mirroring, triple modular redundancy –Consistency Model: depends on data model and level of replication  Strong, weak or eventual consistency 13

14 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales14

15 Client Submit tasks Client Arbitrary Node Figure 1: Simulation architectures; the left part is the centralized one with a single dispatcher connecting all nodes, the right part is the homogeneous distributed topology with each node having the same number of cores and neighbors SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales15 Dispatcher

16 Continuous time simulations –Abandoned the idea of creating a separate thread per simulated node: we found that on our 48-core system with 256GB of memory, we were limited to 32K threads Discrete event simulations –The only viable approach (today) to explore scheduling techniques at exascales (millions of nodes and billions of cores) –Created a unique object per simulated node, and converted any behavior to an event SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales16

17 Introduction & Motivation Long-Term Aims and Contributions SimMatrix Architecture Implementation Evaluation Related Work Contributions Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales17

18 Figure 2: Event State Transition Diagram SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales18 All events are inserted to the queue, sorted based on the occurrence time ascending Handle the first event, advance the simulation time and update the event queue Implemented as red-black tree based “TreeSet” in Java, which ensures Θ(log ) time for insert & remove

19 Node load information –Nested hash maps provides extremely fast performance at large scales Dynamic Task Submission –Aims to reduce the memory foot-print Dynamic Poll interval –Exponential backoff to reduce the number of messages and increase speed of simulation SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales19

20 SimMatrix is developed in JAVA –Sun 64-bit JDK version 1.6.0_22 –1500 lines of code –Code accessible at: http://datasys.cs.iit.edu/projects/SimMatrix/index.html SimMatrix has no other dependencies SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales20

21 Introduction & Motivation Long-Term Aims and Contributions SimMatrix Architecture Implementation Evaluation Related Work Contributions Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales21

22 Fusion system: –fusion.cs.iit.edu –48 AMD Opteron cores at 1.93GHz –256GB RAM –64-bit Linux kernel 2.6.31.5 –Sun 64-bit JDK version 1.6.0_22 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales22

23 Throughput –Number of tasks finished per second. Calculated as total-number-of- tasks/simulation-time. Efficiency –The ratio between the ideal simulation time of completing a given workload and the real simulation time. The ideal simulation time is calculated by taking the average task execution time multiplied by the number of tasks per core. Load Balancing –We adopted the coefficient variance of the number of tasks finished by each node as a measure the load balancing. The smaller the coefficient variance, the better the load balancing is. It is calculated as the standard- deviation/average in terms of number of tasks finished by each node. Scalability –Total number of tasks, number of nodes, and number of cores supported. SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales23

24 Synthetic workloads: –Uniform distributions with different average task lengths, such as 10s (ave_10), 100s (ave_100), 1000s (ave_1000), 5000s (ave_5000), 10000s (ave_10000), and 100000s (ave_100000); also all tasks of 1 sec each (all_1) Realistic application workloads: –General MTC workload from 2008-2009 trace of 173M tasks; average task length 64±486s (mtc_64), using Gamma Distribution SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales24

25 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales Validate SimMatrix against the state-of-the-art MTC systems (e.g. Falkon), to ensure that the simulator can accurately predict the performance of current petascale systems. 25

26 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales26 Fine grained workloads: 2%  99.3% efficiency increase Coarse grained workloads: 99%  99.999% efficiency increase

27 Memory consumption <13 KB/task <200 GB CPU Time <90 us/task <260 hours SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales27

28 Efficiency 90%+ Co-variance <0.06 Load imbalance of <600 tasks from 10K tasks per node SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales28

29 29 Stealing half of neighbor’s work is best strategy!

30 30 Requires linear number of neighbors for good performance!

31 31 An increasing number of neighbors are needed for 90%+ efficiency, with the largest scales requiring square root neighbors (e.g. 1K neighbors from 1M nodes!

32 32 The same optimal parameters achieve 90%+ efficiency across many different workloads!

33 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales33 Centralized scheduling has severe bottleneck, especially for workload with fine granularity. Distributed scheduling has great scalability, for workload with coarse granularity, there is no obvious upper bound

34 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales34 Good Load Balancing Square Root Dynamic Neighbors Starvation Square Root Static Neighbors Good Load Balancing Quarter Static Neighbors Starvation 2 Static Neighbors

35 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales35 Steady state utilization is ~100% at exascales

36 Introduction & Motivation Long-Term Aims and Contributions SimMatrix Architecture Implementation Evaluation Related Work Contributions Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales36

37 Real Job Scheduling Systems: –Condor (University of Wisconsin), Bradley et al, 2012 –PBS (NASA Ames), Corbatto et al, 2012 –LSF Batch (Platform Computing of Toronto), 2011 –Falkon (University of Chicago), Raicu et al, SC07 Job Scheduling System Simulators: –simJava (University of Edinburgh), Wheeler et al, 2004 –GridSim (University of Melbourne, Australia), Buyya et al, 2010 Load Balancing: –Neighborhood averaging scheme, Sinha et al, 1993 –Charm++ (UIUC), Zheng et al, 2011 Scalable Work Stealing –Dinan et al, SC09 –Blumofe et al, Scheduling multithreaded computations by work stealing, 1994 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales37

38 Introduction & Motivation Long-Term Aims and Contributions SimMatrix Architecture Implementation Evaluation Related Work Contributions Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales38

39 Designed, Analyzed, and Implemented a discrete-event simulator (SimMatrix) enabling the study of MTC workloads at exascales Identified work stealing as a viable technique to achieve load balance at exascales Provided evidence that work stealing is scalable by finding optimal parameters affecting the performance of work stealing –Number of tasks to steal is half –Dynamic random neighbors strategy is required –There must be a squared root number of neighbors SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales39

40 Introduction & Motivation Long-Term Aims and Contributions SimMatrix Architecture Implementation Evaluation Related Work Contributions Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales40

41 Explore work stealing for manycore processors with 1000 cores Enhancing the network topology model to allow complex networks Insight from SimMatrix will be used to develop MATRIX, a distributed task execution fabric –MATRIX will employ work stealing for distributed load balancing –MATRIX will be integrated with other projects, such as Swift (a data-flow parallel programming systems) and FusionFS(a distributed file systems) SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales41

42 Exascale systems bring great opportunities in unraveling of significant scientific mysteries There are significant challenges to achieve exascales, such as concurrency, resilience, I/O and memory, heterogeneity, and energy MTC requires a highly scalable and distributed task/job management system at large scales –Distributed scheduling is likely an efficient way to achieve load balancing, leading to high job throughput and system utilization Work stealing is a scalable method to achieve load balance at exascales given the optimal parameters SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales42

43 More information: –http://datasys.cs.iit.edu/~kewang/http://datasys.cs.iit.edu/~kewang/ –http://datasys.cs.iit.edu/projects/SimMatrix/http://datasys.cs.iit.edu/projects/SimMatrix/ Contact: –kwang22@hawk.iit.edukwang22@hawk.iit.edu Questions? SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales43


Download ppt "DataSys Laboratory Dr. Ioan Raicu Michael Lang, USRC leader Abhishek Kulkarni, Ph.D student of Indiana University Poster Submission: –Ke Wang, Abhishek."

Similar presentations


Ads by Google