Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Parallel Processing & Parallel Algorithm May 8, 2003 B4 Yuuki Horita.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Chapter 17 Parallel Processing.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
Winter-Spring 2001Codesign of Embedded Systems1 Introduction to HW/SW Co-Synthesis Algorithms Part of HW/SW Codesign of Embedded Systems Course (CE )
Introduction to Parallel Processing Ch. 12, Pg
Advanced Computer Architectures
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Parallelism Processing more than one instruction at a time. Pipelining
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Department of Computer Science University of the West Indies.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
CSCI 232© 2005 JW Ryder1 Parallel Processing Large class of techniques used to provide simultaneous data processing tasks Purpose: Increase computational.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
VLSI Algorithmic Design Automation Lab. 1 Integration of High-Performance ASICs into Reconfigurable Systems Providing Additional Multimedia Functionality.
- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
CUDA - 2.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.
Pipelining and Parallelism Mark Staveley
Data Structures and Algorithms in Parallel Computing Lecture 1.
FPGA-Based System Design: Chapter 7 Copyright  2004 Prentice Hall PTR Topics n Hardware/software co-design.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Parallel Computing Presented by Justin Reschke
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Classification of parallel computers Limitations of parallel processing.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
These slides are based on the book:
Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.
Distributed and Parallel Processing
Distributed Processors
buses, crossing switch, multistage network.
CS 147 – Parallel Processing
Parallel and Multiprocessor Architectures
Symmetric Multiprocessing (SMP)
buses, crossing switch, multistage network.
Overview Parallel Processing Pipelining
Advanced Computer and Parallel Processing
Presentation transcript:

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2

2 Partition Scheme

3 Driving Force Data-driven –How to divide data sets into different sizes for multiple computing resources –How to coordinate data flows along different directions such that brings appropriate data to the suitable resources at the right time. Function-driven –How to perform different functions of one task on different computing resources at the same time.

4 Data - Flynn's Taxonomy Single Instruction Flow Single Data Stream (SISD) Multiple Instruction Flow Single Data Stream (MISD) Single Instruction Flow Multiple Data Stream (SIMD) –MPI, PVM Multiple Instruction Flow Multiple Data Stream (MIMD) –Shard memory –Distributed memory

5 Data Partitioning Schemes Block ScatterContiguous point Contiguous row

6 Communication Patterns and Costs Communication expense is the first concern in data-driven partition. Successor/Predecessor (S-P) pattern North/South/East/West (NSEW) pattern is the message preparation latency, is the transmission speed (Byte/s), is the number of processors, is the number of data, is the length of each data item to be transmitted.

7 Understanding Data-driven The arrivals of data initiate and synchronize operations in the systems. The whole system in execution is modeled as a network linked by data streams. Granularity of the algorithm: the size of data block that transmitted between processors. The flows of data blocks form data streams. Granularity selection: trade-off between computation and communication –Large: reducing the degree of parallelism; increasing computation time; little overlapping between processors. –Small: increasing the degree of overlapping; increasing communication and overhead time

8 Data Dependency Decreasing even dismissing the speedup Caused by edge pixels on different blocks BlockReverse diagonal

9 Function Partitioning procedure –Evaluating the complexity of individual process in function and the communication between processes –Clustering processes according to objectives –Partitioning optimization

10 Space-time-domain Expansion Definition: sacrificing the processing time to meet the performance requirements. Time complexity:

11 One Dimension Partitioning Keeping the processing size to one column at a time. Repeatedly feeding in data until the process finishes. Increases the time complexity by n (the number of column)

12 Two Dimension Partitioning Fixing the processing size to a two-dimensional subset of the original processing. Increasing the time complexity by

13 Resource Constraints Multi-processor –Software implementation –Homogenous system –Heterogeneous system Hardware/software (HW/SW) co-processing –Software and hardware components are co-designed –Process scheduling VLSI –Hardware implementation –Communication time is ignorable

14 Multi-processor Heterogeneous system –Contains computers in different types of parallelism. –Overheads in communicating add extra delays. –Communication tasks such as allocating buffers and setting up DMA channels have to be performed by the CPU and cannot be overlapped with the computation. Host/Master - a powerful processor Bottleneck processor - the processor taking the longest amount of time to perform the assigned task.

15 HW/SW Co-processing System structure –SW - a single general purpose processor, Pentium or PowerPC –HW- a single hardware coprocessor, FPGA or ASIC –A block of shared memory Design view –Hardware components: RTL components (adders, multipliers, ALUs, registers) –Software component: general-purpose processor –Communication: between the software component and the local memory Partitioning –Most frequent loops generally correspond to 90 percent of execution time but only consisting of simple designs

16 VLSI Constraints –Execution time (DSP ASIC) –Power consumption –Design area –Throughput Examples –Globally asynchronous locally synchronous on-chip bus (Time) –4-way pipelined memory partitioning (Throughput)

17 Question …… Thank you!