Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
Accelerated Path-Based Timing Analysis with MapReduce
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
OpenFOAM on a GPU-based Heterogeneous Cluster
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
Outline Introduction Image Registration High Performance Computing Desired Testing Methodology Reviewed Registration Methods Preliminary Results Future.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
1 Performance Evaluation of Gigabit Ethernet & Myrinet
A Survey of Parallel Tree- based Methods on Option Pricing PRESENTER: LI,XINYING.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade.
HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
RAM, PRAM, and LogP models
1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,
Parallelization of 2D Lid-Driven Cavity Flow
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
An Efficient Linear Time Triple Patterning Solver Haitong Tian Hongbo Zhang Zigang Xiao Martin D.F. Wong ASP-DAC’15.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Distributed simulation with MPI in ns-3 Joshua Pelkey and Dr. George Riley Wns3 March 25, 2011.
By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
Sunpyo Hong, Hyesoon Kim
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Circuit Simulation using Matrix Exponential Method Shih-Hung Weng, Quan Chen and Chung-Kuan Cheng CSE Department, UC San Diego, CA Contact:
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
Evolutionary Computation Evolving Neural Network Topologies.
PERFORMANCE EVALUATIONS
Ioannis E. Venetis Department of Computer Engineering and Informatics
Distributed Network Traffic Feature Extraction for a Real-time IDS
BD-CACHE Big Data Caching for Datacenters
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
Hybrid Programming with OpenMP and MPI
Presentation transcript:

Benchmarking Parallel Code

Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

Benchmarking3 Experimental Studies Write a program implementing the algorithm Run the program with inputs of varying size and composition Use “system clock” to get an accurate measure of the actual running time Plot the results

Benchmarking4 Experimental Dimensions Time Space Number of processors InputData(n,_,_,_) Results(_,_)

Benchmarking5 Features of a good experiment? What are features of every good experiment?

Benchmarking6 Features of a good experiment Reproducibility Quantification of performance Exploration of anomalies Exploration of design choices Capacity to explain deviation from theory

Benchmarking7 Types of experiments Time and Speedup Scaleup Exploration of data variation

Speedup Benchmarking8

9 Time and Speedup time processors fixed data(n,_,_,…) speedup processors fixed data(n,_,_,…) linear

Superlinear Speedup Is this possible? Theoretical, No What is T 1 ? In Practice, yes Cache effects “Relative Speedup” T 1 = parallel code on 1 process without communications Benchmarking10 processors fixed data(n,_,_,…) linear

How to lie about Speedup? Cripple the sequential program! This is a *very* common practice People compare the performance of their parallel program on p processors to its performance on 1 processor, as if this told you something you care about, when in reality their parallel program on one processor runs *much* slower than the best known sequential program does. Moral: anytime anybody shows you a speedup curve, demand to know what algorithm they're using in the numerator. Benchmarking11

Sources of Speedup Anomalies 1. Reduced overhead -- some operations get cheaper because you've got fewer processes per processor 2. Increasing cache size -- similar to the above: memory latency appears to go down because the total aggregate cache size went up 3. Latency hiding -- if you have multiple processes per processor, you can do something else while waiting for a slow remote op to complete 4. Randomization -- simultaneous speculative pursuit of several possible paths to a solution It should be noted that anytime "superlinear" speedup occurs for reasons 3 or 4, the sequential algorithm could (given free context switches) be made to run faster by mimicking the parallel algorithm. Benchmarking12

Benchmarking13 Sizeup time n fixed p, data(_,_,…) What happens when n grows?

Benchmarking14 Scaleup time p fixed n/p, data(_,_,…) What happens when p grows, given a fixed ration n/p?

Benchmarking15 Exploration of data variation Situation dependent Let’s look at an example…

Example of Benchmarking See /paper.pdf /paper.pdf Benchmarking16 We have implemented our optimized data partitioning method for shared-nothing data cube generation using C++ and the MPI communication library. This implementation evolved from (Chen, et.al. 2004), the code base for a fast sequential Pipesort (Dehne, et.al. 2002) and the sequential partial cube method described in (Dehne, et.al. 2003). Most of the required sequential graph algorithms, as well as da ta structures like hash tables and graph representations, were drawn from the LEDA library (LEDA, 2001).

Describe the implementation: Benchmarking17 We have implemented our optimized data partitioning method for shared-nothing data cube generation using C++ and the MPI communication library. This implementation evolved from (Chen, et.al. 2004), the code base for a fast sequential Pipesort (Dehne, et.al. 2002) and the sequential partial cube method described in (Dehne, et.al. 2003). Most of the required sequential graph algorithms, as well as da ta structures like hash tables and graph representations, were drawn from the LEDA library (LEDA, 2001).

Describe the Machine: Benchmarking18 Our experimental platform consists of a 32 node Beowulf style cluster with 16 nodes based on 2.0 GHz Intel Xeon processors and 16 more nodes based on 1.7 GHz Intel Xeon processors. Each node was equipped with 1 GB RAM, two 40GB 7200 RPM IDE disk drives and an onboard Inter Pro 1000 XT NIC. Each node was running Linux Redhat 7.2 with gcc and MPI/LAM as part of a ROCKS cluster distribution. All nodes were interconnected via a Cisco 6509 GigE switch.

Describe how any tuning parameters where defined: Benchmarking19 Our implementation of Algorithm 1 initially runs a performance test to calculate the key machine specific cost parameters, t compute, t read, t write and t network, that drive our optimized dynamic data partitioning method. On our experimental platform these parameters were as follows: t compute= microseconds, t read = microseconds, twrite= microseconds. The network parameter, t network, captures the performance characteristics of the MPI operation “MPI_ALL_TO_ALL_v” on a fixed amount of data relative to the number of processors involved in the communication. On our experimental platform, t network = , , , , , and microseconds for p = 2, 4, 8, 16, 24, and 32, respectively.

Describe the timing methodology: Benchmarking20

Describe Each Experiment Benchmarking21

Analyze the results of each experiment Benchmarking22

Look at Speedup Benchmarking23

Look at Sizeup Benchmarking24

Consider Scaleup Benchmarking25

Consider Application Specific Parameters Benchmarking26

Typical Outline of a Parallel Computing Paper Intro & Motivation Description of the Problem Description of the Proposed Algorithm Analysis of the Proposed Algorithm Description of the Implementation Performance Evaluation Conclusion & Future Work Benchmarking27