ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

U of Houston – Clear Lake
1 Optimizing compilers Managing Cache Bercovici Sivan.
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Compiler Challenges for High Performance Architectures
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
Data Parallel Algorithms Presented By: M.Mohsin Butt
ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Copyright ©2009 Opher Etzion Event Processing Course Engineering and implementation considerations (related to chapter 10)
Data Locality CS 524 – High-Performance Computing.
ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.
Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 8: Modelling Interactions and Behaviour.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Computer Science 112 Fundamentals of Programming II Introduction to Graphs.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
1 Scheduling CEG 4131 Computer Architecture III Miodrag Bolic Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
1 Multithreaded Architectures Lecture 3 of 4 Supercomputing ’93 Tutorial Friday, November 19, 1993 Portland, Oregon Rishiyur S. Nikhil Digital Equipment.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.
CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip.
Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Parallel Computing Presented by Justin Reschke
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
CS427 Multicore Architecture and Parallel Computing
Distributed Processors
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Programming in C with MPI and OpenMP
CSCI1600: Embedded and Real Time Software
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
TensorFlow: A System for Large-Scale Machine Learning
Parallel Programming in C with MPI and OpenMP
CSCI1600: Embedded and Real Time Software
ENERGY 211 / CME 211 Lecture 11 October 15, 2008.
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation

ECE669 L23: Parallel Compilation April 29, 2004 Parallel Compilation °Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code °Develop a parallel compiler Intermediate form Partitioning -Block based or loop based Placement Routing

ECE669 L23: Parallel Compilation April 29, 2004 Compilation technologies for parallel machines Assumptions: Input:Parallel program Output:“Coarse” parallel program & directives for: Which threads run in 1 task Where tasks are placed Where data is placed Which data elements go in each data chunk Limitation: No special optimizations for synchronization -- synchro mem refs treated as any other comm.

ECE669 L23: Parallel Compilation April 29, 2004 Toy example °Loop parallelization

ECE669 L23: Parallel Compilation April 29, 2004 Example °Matrix multiply °Typically, °Looking to find parallelism...

ECE669 L23: Parallel Compilation April 29, 2004 Choosing a program representation... °Dataflow graph No notion of storage problem Data values flow along arcs Nodes represent operations + a0a0 b0b0 c0c0 + a2a2 b2b2 c2c2 + a5a5 b5b5 c5c5

ECE669 L23: Parallel Compilation April 29, 2004 Compiler representation °For certain kinds of structured programs °Unstructured programs A Data X Task A Task B LOOP LOOP nest Data array Index expressions Communication weight Array

ECE669 L23: Parallel Compilation April 29, 2004 Process reference graph Nodes represent threads (processes) computation Edges represent communication (memory references) Can attach weights on edges to represent volume of communication Extension: precedence relations edges can be added too Can also try to represent multiple loop produced threads as one node P0P0 P1P1 P2P2 P5P5 Memory Monolithic memory or Not very useful

ECE669 L23: Parallel Compilation April 29, 2004 Allocate data items to nodes as well Nodes: Threads, data objects Edges: Communication Key: Works for both shared-memory, object-oriented, and dataflow systems! (Msg. passing) Process communication graph P0P0 A0A0 B0B0 C0C0 d1d1 d0d P1P1 A 01 A2A2 C1C1 d3d3 d2d P2P2 B0B0 d5d5 d4d4 P5P5

ECE669 L23: Parallel Compilation April 29, 2004 PCG for Jacobi relaxation Fine PCG Coarse PCG : Computation : Data

ECE669 L23: Parallel Compilation April 29, 2004 Compilation with PCGs Fine process communication graph Partitioning Coarse process communication graph

ECE669 L23: Parallel Compilation April 29, 2004 Compilation with PCGs Fine process communication graph Coarse process communication graph Placement... other phases, scheduling. Dynamic? MP: Partitioning Coarse process communication graph

ECE669 L23: Parallel Compilation April 29, 2004 Parallel Compilation °Consider loop partitioning °Create small local compilation °Consider static routing between tiles °Short neighbor-to-neighbor communication °Compiler orchestrated

ECE669 L23: Parallel Compilation April 29, 2004 Flow Compilation °Modulo unrolling °Partitioning °Scheduling

ECE669 L23: Parallel Compilation April 29, 2004 Modulo Unrolling – Smart Memory °Loop unrolling relies on dependencies °Allow maximum parallelism °Minimize communication

ECE669 L23: Parallel Compilation April 29, 2004 Array Partitioning – Smart Memory °Assign each line to separate memory °Consider exchange of data °Approach is scalable

ECE669 L23: Parallel Compilation April 29, 2004 Communication Scheduling – Smart Memory °Determine where data should be sent °Determine when data should be sent

ECE669 L23: Parallel Compilation April 29, 2004 Speedup for Jacobi – Smart Memory °Virtual wires indicates scheduled paths °Hard wires are dedicated paths °Hard wires require more wiring resources °RAW is a parallel processor from MIT

ECE669 L23: Parallel Compilation April 29, 2004 Partitioning Use heuristic for unstructured programs For structured programs......start from: Arrays List of arrays A B C L0 L1 L2 Loop Nests List of loops

ECE669 L23: Parallel Compilation April 29, 2004 Notion of Iteration space, data space E.g. A Matrix j i Data space Iteration space Represents a “thread” with a given value of i,j

ECE669 L23: Parallel Compilation April 29, 2004 °Partitioning: How to “tile” iteration for MIMD M/Cs data spaces? Notion of Iteration space, data space E.g. A Matrix j i Data space Iteration space Represents a “thread” with a given value of i,j This thread affects the above computation +

ECE669 L23: Parallel Compilation April 29, 2004 Loop partitioning for caches °Machine model Assume all data is in memory Minimize first-time cache fetches Ignore secondary effects such as invalidations due to writes Memory Network Cache PPP A

ECE669 L23: Parallel Compilation April 29, 2004 Summary °Parallel compilation often targets block based and loop based parallelism °Compilation steps address identification of parallelism and representations Graphs often useful to represent program dependencies °For static scheduling both computation and communication can be represented °Data positioning is an important for computation