 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Distributed Systems CS

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

OpenFOAM on a GPU-based Heterogeneous Cluster

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.

Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.

Lecture 1: Introduction to High Performance Computing.

Introduction Computational Challenges Serial Solutions Distributed Memory Solution Shared Memory Solution Parallel Analysis Conclusion Introduction: 

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.

Computer System Architectures Computer System Software

Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Performance Evaluation of Parallel Processing. Why Performance?

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Sigrity, Inc © Efficient Signal and Power Integrity Analysis Using Parallel Techniques Tao Su, Xiaofeng Wang, Zhengang Bai, Venkata Vennam Sigrity,

Independent Component Analysis (ICA) A parallel approach.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.

Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.

Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.

CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.

Classic Model of Parallel Processing

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

These slides are based on the book:

Organizations Are Embracing New Opportunities

Lynn Choi School of Electrical Engineering

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Chilimbi, et al. (2014) Microsoft Research

Parallel ODETLAP for Terrain Compression and Reconstruction

Introduction to Parallelism.

Parallel Programming in C with MPI and OpenMP

EE 193: Parallel Computing

What is Parallel and Distributed computing?

Parallel Processing Sharing the load.

Compiler Back End Panel

Summary Background Introduction in algorithms and applications

CS110: Discussion about Spark

Twister4Azure : Iterative MapReduce for Azure Cloud

Compiler Back End Panel

Hybrid Programming with OpenMP and MPI

Introduction to Heterogeneous Parallel Computing

Chapter 4 Multiprocessors

Chapter 01: Introduction

Vrije Universiteit Amsterdam

Multicore and GPU Programming

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application  Results and discussion Proprietary & Confidential2

 The commodity computing environment underwent a paradigm shift in the last 7 years Power hungry frequency-scaling replaced by more efficient scaling-in-computing-cores Proprietary & Confidential3  “No free-lunch” for software applications 5M 60M 3.4G

Proprietary & Confidential4 Shared Memory ALU Cache TestCase 1 TestCase 2 TestCase 3 TestCase 4 TestCase.. Tool.exe TestCase 1 Tool.exe TestCase 2TestCase 3TestCase 4  “Embarrassingly Parallel” Class of Problems - Suitable for applications with low memory requirement Case 2Case3Case 4Case 1  What about memory intensive applications ? - Large SPICE runs, Extraction

Proprietary & Confidential5 Multithreading MPI Distribution CPU – GPU (FPGA)

Proprietary & Confidential6 Compute PlatformTimeUtility Computing Cost 1 Compute Unit100 hours$x 100 Compute Units1 hours$x Compute PlatformTimeUtility Computing Cost 1 Compute Unit100 hours$x 100 Compute Units50 hours$50x Algorithm in which 100% can be efficiently parallelized Algorithm in which 50% can be efficiently parallelized

The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application  Results and discussion Proprietary & Confidential7

8 P = Fraction of the algorithm that can be parallelized (1-P) = Serial content of the algorithm Parallel performance of an algorithm is severely affected by any serial content Scaling =

Proprietary & Confidential9 Speed Up = Cost Multiplier =  Power of perfect scaling: Commodity computing cost does not increase even when solving 100x faster P=0.9 P=0.99 P=0.999 Ideal: P=1 P=0.9 P=0.99 P=0.999 Ideal: P=1

The need for parallelization Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application  Results and discussion Proprietary & Confidential10

Signal Integrity Power Integrity EMI Large Scale Package-Board Analysis

- Surface is discretized into patches (Basis Functions) Pulse Dense Method of Moments Matrix

SchemeSetupSolveTimeMemory Conventional BEMO(N 2 )O(N 3 )2 Days360 GB Fast BEMO(N) 20minutes4 GB N = Number of basis functions; (150,000) Large problems: N ~ 1 million

Proprietary & Confidential14 Gravitational Image: Courtesy ParticleMan Astronomy Equivalent Interaction Shell Multi-level Algorithm

Proprietary & Confidential15 Q2M L2L M2M M2L Sequential Algorithm: Impedes Effective Parallelization Finest Level Finest - 1 Level L2P L2L Up Tree Across Tree Down Tree

Proprietary & Confidential16 Dense MoM Matrix Z sub = SVD/QR Decomposition Saving Q R m nr m n r

Proprietary & Confidential17  Pre-determined matrix structure  Sub-matrix ranks pre-estimated to reasonable accuracy

Proprietary & Confidential18 Predetermined matrix structure given meshed geometry  Static load-balancing  Correction by dynamic scheduling

Proprietary & Confidential19 MPI processes Data Comm. OpenMP Master Threads OpenMP Worker Threads Start Read Geometry, Mesh Setup Solve  All nodes read geometry and mesh  Combined DRAM of nodes are used to store the whole matrix, increasing capacity  Solve: Vectors are broadcast for matrix- vector operations

Proprietary & Confidential20 Multiple cores use shared memory utilized to store and compute single frequency matrix Different frequency matrices on separate node memory Reveals an embarrassingly parallel level without wasting shared memory

Proprietary & Confidential21  Several layers of parallelism, including embarrassingly parallel layers are employed in a hybrid framework to work around Amdahl’s law and achieve high scalability  Global controller decides the number and type of compute instances to be used at each level  File R/W for SGE based parallel layers is performed in binary

The need for parallelization Challenges towards effective parallelization A multilevel parallelization framework for BEM: A compute intensive application  Results and discussion Proprietary & Confidential22

Proprietary & Confidential23  Predetermined matrix structure and static-dynamic load balancing achieves close to ideal speed ups  Lack of dynamic scheduling across nodes affects MPI speedup

Proprietary & Confidential24  For large number of right hand sides, SGE based RHS distribution yields better scalability  Map-reduce after every matrix vector product affects parallel efficiency

Proprietary & Confidential25

Proprietary & Confidential26

Proprietary & Confidential27 Time TakenSpeed-upEffective Compute Cost 2 core (m1.large) 100 minutes1$ machines with 8 cores (c1.xlarge) 2 minutes50$ 0.40 System LSI package and an early design board merged Meshed with 20,000 basis functions Broadband frequency simulation for EMI prediction

Proprietary & Confidential28

 The future of compute and memory intensive algorithms is “parallel”  Revisit algorithms from a perspective of multi-processor efficiency as opposed to single-processor complexity  Even a small serial content can adversely affect the speedup of an algorithm  Employ levels of parallelism to work-around Amdahl’s law  Presented algorithm achieves around 6.5x speedup on a single 8 core machine and around 50x speed up in a cloud framework using 72 (18x4) cores Proprietary & Confidential29