High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

CSCI 4717/5717 Computer Architecture
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPU Review and Programming Models CT101 – Computing Systems.
Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.
Programmability Issues
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
The Assembly Language Level
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Henry Hexmoor1 Chapter 10- Control units We introduced the basic structure of a control unit, and translated assembly instructions into a binary representation.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
1 DATAFLOW ARCHITECTURES JURIJ SILC, BORUT ROBIC, THEO UNGERER.
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
Chapter 5 Basic Processing Unit
CMPE 511 DATA FLOW MACHINES Mustafa Emre ERKOÇ 11/12/2003.
CS321 Functional Programming 2 © JAS Implementation using the Data Flow Approach In a conventional control flow system a program is a set of operations.
Computer Architecture Dataflow Machines. Data Flow Conventional programming models are control driven Instruction sequence is precisely specified Sequence.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
1 Multithreaded Architectures Lecture 3 of 4 Supercomputing ’93 Tutorial Friday, November 19, 1993 Portland, Oregon Rishiyur S. Nikhil Digital Equipment.
Static Program Analyses of DSP Software Systems Ramakrishnan Venkitaraman and Gopal Gupta.
Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.
Execution of an instruction
1 Chapter 2 Dataflow Processors. 2 Dataflow processors Recall from basic processor pipelining: Hazards limit performance. – Structural hazards – Data.
Different parallel processing architectures Module 2.
Operand Addressing And Instruction Representation Cs355-Chapter 6.
1 A single-cycle MIPS processor  An instruction set architecture is an interface that defines the hardware operations which are available to software.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
March 4, 2008http://csg.csail.mit.edu/arvindDF3-1 Dynamic Dataflow Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Computer Architecture Dataflow Machines Sunset over Lifou, New Caledonia.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
MIPS Processor.
Computer Architecture: Dataflow (Part I) Prof. Onur Mutlu Carnegie Mellon University.
CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
Advanced Architectures
Dataflow Machines CMPE 511
William Stallings Computer Organization and Architecture 8th Edition
Prof. Onur Mutlu Carnegie Mellon University
Fall 2012 Parallel Computer Architecture Lecture 22: Dataflow I
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Prof. Sirer CS 316 Cornell University
Instruction Level Parallelism and Superscalar Processors
The Von Neumann Model Basic components Instruction processing
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Instruction Level Parallelism (ILP)
Computer Architecture
Topic B (Cont’d) Dataflow Model of Computation
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
COMPUTER ORGANIZATION AND ARCHITECTURE
Samira Khan University of Virginia Jan 23, 2019
Advanced Computer Architecture Dataflow Processing
Conceptual execution on a processor which exploits ILP
Prof. Onur Mutlu Carnegie Mellon University
Chapter 4 The Von Neumann Model
Presentation transcript:

High Performance Architectures Dataflow Part 3

2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards  Data hazards due to true dependences or name (false) dependences: anti and output dependences  Control hazards Name dependences can be removed by:  compiler (register) renaming  renaming hardware  advanced superscalars  single-assignment rule  dataflow computers Data hazards due to true dependences and control hazards can be avoided if succeeding instructions in the pipeline stem from different contexts  dataflow computers, multithreaded processors

3 Dataflow Model of Computation Enabling rule:  An instruction is enabled (i.e. executable) if all operands are available.  Von Neumann model: an instruction is enabled if it is pointed to by PC. The computational rule or firing rule, specifies when an enabled instruction is actually executed. Basic instruction firing rule:  An instruction is fired (i.e. executed) when it becomes enabled. The effect of firing an instruction is the consumption of its input data (operands) and generation of output data (results) Where are the structural hazards?

4 Dataflow languages Main characteristic: The single-assignment rule  A variable may appear on the left side of an assignment only once within the area of the program in which it is active. Examples: VAL, Id, LUCID A dataflow program is compiled into a dataflow graph which is a directed graph consisting of named nodes, which represent instructions, and arcs, which represent data dependences among instructions  The dataflow graph is similar to a dependence graph used in intermediate representations of compilers. During the execution of the program, data propagate along the arcs in data packets, called tokens This flow of tokens enables some of the nodes (instructions) and fires them

5 Dataflow Architectures - Overview Pure dataflow computers:  static,  dynamic,  and the explicit token store architecture. Hybrid dataflow computers:  Augmenting the dataflow computation model with control-flow mechanisms, such as RISC approach, complex machine operations, multi-threading, large-grain computation, etc.

6 Pure Dataflow A dataflow computer executes a program by receiving, processing and sending out tokens, each containing some data and a tag. Dependences between instructions are translated into tag matching and tag transformation. Processing starts when a set of matched tokens arrives at the execution unit. The instruction which has to be fetched from the instruction store (according to the tag information) contains information about  what to do with data  and how to transform the tags The matching unit and the execution unit are connected by an asynchronous pipeline, with queues added between the stages Some form of associative memory is required to support token matching  a real memory with associative access,  a simulated memory based on hashing,  or a direct matched memory

7 Static Dataflow A dataflow graph is represented as a collection of activity templates, each containing:  the opcode of the represented instruction,  operand slots for holding operand values,  and destination address fields, referring to the operand slots in sub-sequent activity templates that need to receive the result value. Each token consists only of a value and a destination address.

8 Dataflow graph and Activity template

9 Acknowledgement signals Notice, that different tokens destined for the same destination cannot be distinguished. Static dataflow approach allows at most one token on any one arc. Extending the basic firing rule as follows:  An enabled node is fired if there is no token on any of its output arcs. Implementation of the restriction by acknowledge signals (additional tokens ), traveling along additional arcs from consuming to producing nodes. Using acknowledgement signals, the firing rule can be changed to its original form:  A node is fired at the moment when it becomes enabled. Again: structural hazards are ignored assuming unlimited resources!

10 MIT Static Dataflow Machine

11 Deficiencies of Static Dataflow Consecutive iterations of a loop can only be pipelined. Due to acknowledgment tokens, the token traffic is doubled. Lack of support for programming constructs that are essential to modern programming language  no procedure calls,  no recursion. Advantage: simple model

12 Dynamic Dataflow Each loop iteration or subprogram invocation should be able to execute in parallel as a separate instance of a reentrant subgraph. The replication is only conceptual. Each token has a tag:  address of the instruction for which the particular data value is destined  and context information Each arc can be viewed as a bag that may contain an arbitrary number of tokens with different tags. The enabling and firing rule is now:  A node is enabled and fired as soon as tokens with identical tags are present on all input arcs. Structural hazards ignored!

13 MERGE and SWITCH nodes

14 Branch Implementations Branch Speculative branch evaluation

15 Basic Loop Implementation X f TF L D D L new x n i n j n l n m n k SWITCH P  L: initiation, new loop context  D: increments loop iteration number  D -1 : reset loop iteration number to 1  L -1 : restore original context

16 Function application  A: create new context  BEGIN: replicate tokens for each fork  END: return results, unstack return address  A -1 : replicate output for successors

17 MIT Tagged-Token Dataflow Architecture

18 Manchester Dataflow Machine

19 Advantages and Deficiencies of Dynamic Dataflow Major advantage: better performance (compared with static) because it allows multiple tokens on each arc thereby unfolding more parallelism. Problems:  efficient implementation of the matching unit that collects tokens with matching tags. Associative memory would be ideal. Unfortunately, it is not cost-effective since the amount of memory needed to store tokens waiting for a match tends to be very large. All existing machines use some form of hashing techniques.  bad single thread performance (when not enough workload is present)  dyadic instructions lead to pipeline bubbles when first operand tokens arrive  no instruction locality  no use of registers

20 Explicit Token Store (ETS) Approach Target: efficient implementation of token matching. Basic idea: allocate a separate frame in a frame memory for each active loop iteration or subprogram invocation. A frame consists of slots; each slot holds an operand that is used in the corresponding activity. Access to slots is direct (i.e. through offsets relative to the frame pointer)  no associative search is needed.

21 Explicit Token Store

22 Monsoon, an Explicit Token Store Machine Processing Element Multistage Packet Switching Network PE I-Structure Storage I-Structure Storage... Frame Memory Form Token User Queue System Queue Instruction Memory ALU Instruction Fetch Effective Address Generation Presence Bit Operation Frame Operation to/from the Communication Network

23 WaveCache: A Dataflow Processor WaveScalar is an ISA of a dataflow processor named WaveCache The WaveCache is a grid of approximately 2K processing elements (PEs) arranged into clusters of 16

24 WaveCache: A Dataflow Processor A WaveScalar executable contains an encoding of the program dataflow graph The instructions explicitly send data values to the instructions that need them instead of broadcasting them via the register file The potential consumers are known at compile time, but depending on control flow, only a subset of them should receive the values at run- time

25 WaveCache: A Dataflow Processor Traditional imperative languages provide the programmer with a model of memory known as total load-store ordering WaveScalar brings load-store ordering to dataflow computing using wave-ordered memory Wave-ordered memory annotates each memory operation with its location in its wave and its ordering relationships (defined by the control flow graph) with other memory operations in the same wave