Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Slides:



Advertisements
Similar presentations
Scheduling in Web Server Clusters CS 260 LECTURE 3 From: IBM Technical Report.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
LOAD BALANCING IN A CENTRALIZED DISTRIBUTED SYSTEM BY ANILA JAGANNATHAM ELENA HARRIS.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Workload-Driven Architectural Evaluation. 2 Difficult Enough for Uniprocessors Workloads need to be renewed and reconsidered Input data sets affect key.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
ECE669 L9: Workload Evaluation February 26, 2004 ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation.
Workload-Driven Architectural Evaluation (II). 2 Outline.
Parallel Processing & Distributed Systems Thoai Nam Chapter 2.
Parallel System Performance CS 524 – High-Performance Computing.
1 Routing and Scheduling in Web Server Clusters. 2 Reference The State of the Art in Locally Distributed Web-server Systems Valeria Cardellini, Emiliano.
Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
Workload-Driven Architecture Evaluation Todd C. Mowry CS 495 September 17-18, 2002.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
CS 284a, 4 November 1997 Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997.
Parallel System Performance CS 524 – High-Performance Computing.
EECC756 - Shaaban #1 lec # 11 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.
1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Static Process Scheduling
Outline Why this subject? What is High Performance Computing?
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Concurrency and Performance Based on slides by Henri Casanova.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Workload-Driven Evaluation Dr. Xiao Qin.
OPERATING SYSTEMS CS 3502 Fall 2017
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
18-447: Computer Architecture Lecture 30B: Multiprocessors
4- Performance Analysis of Parallel Programs
Andy Wang CIS 5930 Computer Systems Performance Analysis
PA an Coordinated Memory Caching for Parallel Jobs
Workload-Driven Architecture Evaluation Todd C
Chapter 3: Principles of Scalable Performance
Department of Computer Science University of California, Santa Barbara
Architectural Interactions in High Performance Clusters
COMP60621 Fundamentals of Parallel and Distributed Systems
Department of Computer Science University of California, Santa Barbara
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare two or more systems  To compare different algorithms  metric to be used?  speedup

Rassul Ayani 2 Workload-Driven Evaluation  Approach  Run a workload (trace) and measure performance of the system  Traces  Real trace  Synthetic trace  Other issues  How representative is the workload?

Rassul Ayani 3 Type of systems  For existing systems  Run workload and evaluate performance of the system  Problem: Is the workload representative?  For future systems (an architectural idea):  Develop a simulator of the system  run workload and evaluate the system  Problem: Developing a simulator is difficult and expensive How do define system parameters, such as memory access time and communication cost?

Rassul Ayani 4 Measuring Performance  Performance metric most important to end user  Performance =Work / Time unit  Performance improvement due to parallelism Time(1) Time(p) Speedup=

Rassul Ayani 5 Performance evaluation of a parallel computer  Speedup(p) = Time(1) / Time(p)  What is Time(1)? 1. Parallel program on one processor of parallel machine? 2. A sequential algorithm on one processor of the parallel machine? 3. “Best” sequential program on one processor of the parallel machine? 4. “Best” sequential program on agreed-upon standard machine?  Which one is reasonable?

Rassul Ayani 6 Speedup  What is Time(p)?  The time needed by the parallel machine to run the same workload?  Is it fair?  How does the size affects our measurement?

Rassul Ayani 7 Example 1: Our experimence Parallel simulation of Multistage Interconnection Network (MIN) d: number of stages n: number of nodes n= (d+1)* 2 d

Rassul Ayani 8 Size speedup Speedup of MIN on CM-2 Speedup=T(1)/T(p), where T(1)=execution time of sequential simulator on a sun sparc T(p)=execution time of parallel simulator on CM-2 with 8k processors

Rassul Ayani 9 Why problem size is important?  The problem size is too small:  May be appropriate for small machine, but not for the parallel machine  Not enough work for the PM  Parallelism overheads begin to dominate benefits for the PM Load imbalance Communication to computation ratio  May even achieve slowdowns  Doesn’t reflect real usage, and inappropriate for large machines  Can exaggerate benefits of architectural improvements, especially when measured as percentage improvement in performance

Rassul Ayani 10 Size is too large  May not “fit” in small machine  Can’t run  Thrashing to disk  Working set doesn’t fit in cache  May lead to super linear speedup  What is the right size?  How do we find the right size?

Rassul Ayani 11 Scaling: Example 2 Small and big equation solvers on SGI Origin2000 (fom Parallel Computer Architecture, Culler & Singh)

Rassul Ayani 12 Scaling issues  Important issues  Reasonable problem size  Scaling problem size  Scaling machine size  Example  Consider a dispatcher based cluster and compare three load balancing algorithms  Round Robin (RR)  Least connection (LC) first  Least loaded first (LL)

Rassul Ayani 13 Scalin: example Dispatcher based web server

Rassul Ayani 14 Determine the problem size

Rassul Ayani 15 Scale problem size, but keep machine size fixed arrival rate (requests/s ec) average waiting time (ms)average response time (ms)average utilization Baselin e RRLCBaselineRRLC Baselin e RRLC                Table 3: Performance of a 4-Server Cluster

Rassul Ayani 16 Scaling problem size (cont’d) Conclusion: for low arrival rate LC is much better than RR, but for high arrival rate both converge to the BL algorithm Is it a fare conclusion?

Rassul Ayani 17 Scaling problem and machine size no. of servers arrival rate average response time (ms) average waiting time (ms)average utilization baselin e RRLC baselin e RRLC baselin e RRLC             0.002

Rassul Ayani 18 Scaling problem and machine size (cont’d) Conclusion: LC is much better than RR Is it a fare conclusion?

Rassul Ayani 19 Questions in Scaling  How should the application be scaled?  Look at the web server  Scaling machine size e.g., by adding identical nodes, each bringing memory  Memory size is increased  Locality may be changed  Extra work (e.g., overhead for task scheduling) will be increased  Problem size: scaling problem size may change  locality  working set size  Communication cost

Rassul Ayani 20 Why Scaling?  Two main reasons for scaling:  to increase performance, e.g. increase number of transactions per second  Of interest to users  to utilize resources (processor and memory) more efficiently  more interesting for managers  More difficult  scaling models:  Problem constrained (PC)  Memory constrained (MC)  Time constrained (TC)

Rassul Ayani 21 Problem Constrained Scaling  Problem size is kept fixed, but the machine is scaled  Motivation: User wants to solve the same problem, only faster.  Some examples:  Video compression  Computer graphics  Message routing in a router (or switch) Speedup (p) = Time(1) Time(p)

Rassul Ayani 22 Machine Constrained Scaling  Scale problem size, but the machine (memory) remains fixed  Motivation: It is good to find limits of a given machine e.g., what is the maximum problem size that can avoid memory thrashing?  Performance measurement:  previous definition of Speedup: Time(1) / Time(p) NOT valid  New definition:  Performance improvement = increase in work/increase in time  How to measure work?  Work can be defined as the number of instructions, operations, or transactions

Rassul Ayani 23 Time Constrained Scaling  Time is kept fixed as the machine is scaled  Motivation: User has fixed time to use the machine (or wait for result as in real-time systems), but wish to do more work during this time  Performance = Work/Time as usual, and time is fixed, so  Speedup TC (p) =  How Work(1) affects the result?  Work(1) must be reasonable to avoid thrashing Work(p) Work(1)

Rassul Ayani 24 Evaluation using Workload  Must consider three major factors:  Workload characteristics  Problem Size  machine size

Rassul Ayani 25 Impact of Workload  Should adequately represent domains of interest  Easy to mislead with workloads  Choose those with features for which machine is good, avoid others  Some features of interest:  Working set size and spatial locality  Fine-grained or coarse-grained tasks  Synchronization patterns  Contention, and Communication patterns  Should have enough to utilize the processors  If load imbalance dominates, may not be much machine can do

Rassul Ayani 26 Problem size  Many critical characteristics depend on problem size  Communication pattern (IPC)  Synchronization pattern  Load imbalance  Need to choose problem sizes appropriately  Insufficient to use a single problem size

Rassul Ayani 27 Steps in Choosing Problem Sizes 1. Expert view  May know that users care only about a few problem sizes  2. Determine range of useful sizes Below which bad performance or unrealistic time distribution in phases Above which execution time or memory usage too large  3. Use understanding of inherent characteristics Communication-to-computation ratio, load balance...

Rassul Ayani 28 Summary  Performance improvement due to parallelism is often measured by speedup  Problem size is important  Scaling is often needed  Scaling models are fundamental to proper evaluation  Time constrained scaling is a realistic method for many applications  Scaling only data problem size can yield misleading results  Proper scaling requires understanding the workload