Lecture 5: Lecturer: Simon Winberg Review of paper: Temporal Partitioning Algorithm for a Coarse-grained Reconfigurable Computing Architecture by Chongyong.

Slides:



Advertisements
Similar presentations
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Distributed Systems CS
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Lecturer: Simon Winberg Lecture 18 Amdahl’s Law & YODA Blog & Design Review.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Parallel System Performance CS 524 – High-Performance Computing.
1 Complexity of Network Synchronization Raeda Naamnieh.
Reference: Message Passing Fundamentals.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
Parallel System Performance CS 524 – High-Performance Computing.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Designing Parallel Programs David Rodriguez-Velazquez CS-6260 Spring-2009 Dr. Elise de Doncker.
Lecture 6: Design of Parallel Programs Part I Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Lecture 7: Design of Parallel Programs Part II Lecturer: Simon Winberg.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Lecturer: Simon Winberg Lecture 21 Reflections, key steps* and A short comprehensive assignment * Relates to Martinez, Bond and Vai Ch 4. Attribution-ShareAlike.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Lecture 8: Design of Parallel Programs Part III Lecturer: Simon Winberg.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Lecture 7: Design of Parallel Programs Part II Lecturer: Simon Winberg.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.
Outline Why this subject? What is High Performance Computing?
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Background Computer System Architectures Computer System Software.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Lecture 4: Lecturer: Simon Winberg Temporal and Spatial Computing, Shared Memory, Granularity Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 3: Harvard Architecture, Shared Memory, Architecture Case Studies Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Design of Parallel Programs / Heterogeneous Computing Solutions
EEE4084F Digital Systems NOT IN 2017 EXAM Lecture 25
Parallel and Distributed Simulation Techniques
The University of Adelaide, School of Computer Science
Lecture 5: GPU Compute Architecture
Parallel and Multiprocessor Architectures
Design of Parallel Programs / Heterogeneous Computing Solutions
CSCI1600: Embedded and Real Time Software
EEE4084F Digital Systems NOT IN 2018 EXAM Lecture 24X
Lecture 5: GPU Compute Architecture for the last time
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.
COMP60621 Fundamentals of Parallel and Distributed Systems
COMP60611 Fundamentals of Parallel and Distributed Systems
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Lecture 5: Lecturer: Simon Winberg Review of paper: Temporal Partitioning Algorithm for a Coarse-grained Reconfigurable Computing Architecture by Chongyong Yin, Shouyi Yin, Leibo Liu, Shaojun Wei Blocking and Non-blocking comms Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

 The problem  Terms  Approach Licensing details last slide

 “Reconfigurable computing architecture (RCA) [1] is intended to fill the gap between application specific integrated circuits (ASICs) and instruction set processors”  Achieving potentially much higher performance than processors, while maintaining a higher level of flexibility than ASICs. A typically RCA consists of reconfigurable hardware along with microprocessors.

 A temporal partitioning algorithm for a coarse grained reconfigurable computing architecture is presented to improve system’s performance for satisfying the constraints of application parts executed on the reconfigurable hardware

uProcessorFPGA Task / computation Want to partition the processing in a way so that it is executed optimally on the available RCA CPU has lots of program space but programs are comprised of a small set of instructions run sequentially FPGA has (comparatively) very limited program space, but can have very complex instructions that could run in parallel Cant fit? Try alternate

 The level based algorithm:  tries to achieve maximum possible parallelism in each module.  The clustering based algorithm  tries to minimize the communication overhead between modules … BUT these algorithms focus on the a single partitioning objective, so they aren’t suitable for multi-objective optimization. So, they looked at other know approaches such as…

 Network flow-based algorithm to find Dynamically reconfigurable computing (DRC) multiple temporal partition stages  But that approach isn’t an optimal one when the goal is to minimize communications costs.  Modified network flow-based partitioning algorithm with added scheduling technique to minimize comms cost  Doesn’t consider the prospective parallelism and the number of modules

 PROBLEM FORMULATION  Given a directed acyclic graph G=(V,E), to ensure proper precedence constraints, each node must be scheduled in a partition or module no later than its successor. For multi-objective optimization  Necessary to build ready partition list to avoid deadlock partition.

Using a selection of benchmarking applications. Comparing the performance of the various standard partitioning algorithms that were reviewed in the introduction to the prototyped approach…

Communications & Costs of Communication EEE4084F

 The communications needs between tasks depends on your solution:  Communications not needed for  Minimal or no shared data or results.  E.g., an image processing routine where each pixel is dimmed (e.g., 50% dim). Here, the image can easily be separated between many tasks that act entirely independently of one other.  Usually the case for embarrassingly parallel solutions

 The communications needs between tasks depends on your solution:  Communications is needed for…  Parallel applications that need to share results or boundary information. E.g.,  E.g., modeling 2D heat diffusion over time – this could divide into multiple parts, but boundary results need to be shared. Changes to an elements in the middle of the partition only has an effect on the boundary after some time.

 Cost of communications  Latency vs. Bandwidth  Visibility of communications  Synchronous vs. asynchronous communications  Scope of communications  Efficiency of communications  Overhead and Complexity

 Communication between tasks has some kind of overheads, such as:  CPU cycles, memory and other resources that could be used for computation are instead used to package and transmit data.  Also needs synchronization between tasks, which may result in tasks spending time waiting instead of working.  Competing communication traffic could also saturate the network bandwidth, causing performance loss.

 Communications is usually both explicit and highly visible when using the message passing programming model.  Communications may be poorly visible when using the data parallel programming model.  For data parallel design on a distributed system, communications may be entirely invisible, in that the programmer may have no understanding (and no easily obtainable means) to accurately determine what inter-task communications is happening.

 Synchronous communications  Require some kind of handshaking between tasks that share data / results.  May be explicitly structured in the code, under control of the programmer – or it may happen at a lower level, not under control of the programmer.  Synchronous communications are also referred to as blocking communications because other work must wait until the communications has finished.

 Asynchronous communications  Allow tasks to transfer data between one another independently. E.g.: task A sends a message to task B, and task A immediately begin continues with other work. The point when task B actually receives, and starts working on, the sent data doesn't matter.  Asynchronous communications are often referred to as non-blocking communications.  Allows for interleaving of computation and communication, potentially providing less overhead compared to the synchronous case

 Scope of communications:  Knowing which tasks must communicate with each other  Can be crucial to an effective design of a parallel program.  Two general types of scope:  Point-to-point (P2P)  Collective / broadcasting

 Point-to-point (P2P)  Involves only two tasks, one task is the sender/producer of data, and the other acting as the receiver/consumer.  Collective  Data sharing between more than two tasks (sometimes specified as a common group or collective).  Both P2P and collective communications can be synchronous or asynchronous.

Collective communications Typical techniques used for collective communications: Initiator Task BROADCAST Same message sent to all tasks Initiator Task SCATTER Different message sent to each tasks Task GATHER Messages from tasks are combined together Task REDUCING Only parts, or reduced form, of the messages are worked on

 There may be a choice of different communication techniques  In terms of hardware (e.g., fiberoptics, wireless, bus system), and  In terms of software / protocol used  Programmer may need to use a combination of techniques and technology to establish the most efficient choice (in terms of speed, power, size, etc).

Image sources: Gold bar: Wikipedia (open commons) IBM Blade (CC by 2.0) ref: Takeaway, Clock, Factory and smoke – public domain CC0 ( Forrest of trees: Wikipedia (open commons) Moore’s Law graph, processor families per supercomputer over years – all these creative commons, commons.wikimedia.orgcommons.wikimedia.org Disclaimers and copyright/licensing details I have tried to follow the correct practices concerning copyright and licensing of material, particularly image sources that have been used in this presentation. I have put much effort into trying to make this material open access so that it can be of benefit to others in their teaching and learning practice. Any mistakes or omissions with regards to these issues I will correct when notified. To the best of my understanding the material in these slides can be shared according to the Creative Commons “Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)” license, and that is why I selected that license to apply to this presentation (it’s not because I particulate want my slides referenced but more to acknowledge the sources and generosity of others who have provided free material such as the images I have used).