Computer Science and Engineering Laboratory, 01.01.2011 Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Slides:

Advertisements

Similar presentations

Computer Organization, Bus Structure

Advertisements

Instruction Set Design

Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.

ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.

RISC / CISC Architecture By: Ramtin Raji Kermani Ramtin Raji Kermani Rayan Arasteh Rayan Arasteh An Introduction to Professor: Mr. Khayami Mr. Khayami.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 2: IT Students.

There are two types of addressing schemes:

COMP25212 Advanced Pipelining Out of Order Processors.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

ITCS 3181 Logic and Computer Systems 2015 B. Wilkinson slides3.ppt Modification date: March 16, Addressing Modes The methods used in machine instructions.

Computer Architecture Abhinav Agarwal Veeramani V.

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

Computer Architecture and Data Manipulation Chapter 3.

1 Lecture-2 CSIT-120 Spring 2001 Revision of Lecture-1 Introducing Computer Architecture The FOUR Main Elements Fetch-Execute Cycle A Look Under the Hood.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

CHAPTER 4 COMPUTER SYSTEM – Von Neumann Model

Midterm Wednesday Chapter 1-3: Number /character representation and conversion Number arithmetic Combinational logic elements and design (DeMorgan’s Law)

Levels in Processor Design

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

1 Lecture-2 CS-120 Fall 2000 Revision of Lecture-1 Introducing Computer Architecture The FOUR Main Elements Fetch-Execute Cycle A Look Under the Hood.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 4: IT Students.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 4:

Processor Structure & Operations of an Accumulator Machine

Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.

An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.

The CPU The Central Presentation Unit Main Memory and Addresses Address bus and Address Space Data Bus Control Bus The Instructions set Mnemonics Opcodes.

Dr. Rabie A. Ramadan Al-Azhar University Lecture 6

RM2D Let’s write our FIRST basic SPIN program!. The Labs that follow in this Module are designed to teach the following; Turn an LED on – assigning I/O.

Computer Architecture and Organization Introduction.

Instruction Set Architecture

Multiple-bus organization

Chapter 1 Introduction. Architecture & Organization 1 Architecture is those attributes visible to the programmer —Instruction set, number of bits used.

CSCI 211 Intro Computer Organization –Consists of gates for logic And Or Not –Processor –Memory –I/O interface.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Computer Architecture CSE 3322 Lecture 2 NO CLASS MON Sept 1 Course WEB SITE crystal.uta.edu/~jpatters.

Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.

CDA 3101 Fall 2013 Introduction to Computer Organization

CS 232: Computer Architecture II Prof. Laxmikant (Sanjay) Kale.

DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.

MICROOCESSORS AND MICROCONTROLLER:

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Computer Organization and Assembly Languages Yung-Yu Chuang 2005/09/29

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Chapter 2: Data Manipulation

Simple ALU How to perform this C language integer operation in the computer C=A+B; ? The arithmetic/logic unit (ALU) of a processor performs integer arithmetic.

CBP 2002ITY 270 Computer Architecture1 Module Structure Whirlwind Review – Fetch-Execute Simulation Instruction Set Architectures RISC vs x86 How to build.

Group 1 chapter 3 Alex Francisco Mario Palomino Mohammed Ur-Rehman Maria Lopez.

بسم الله الرحمن الرحيم MEMORY AND I/O.

RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.

Microprocessor and Microcontroller Fundamentals

Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)

COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE

Morgan Kaufmann Publishers The Processor

Morgan Kaufmann Publishers

Overview Introduction General Register Organization Stack Organization

A Closer Look at Instruction Set Architectures

Out of Order Processors

The University of Adelaide, School of Computer Science

Levels in Processor Design

The Processor Lecture 3.1: Introduction & Logic Design Conventions

Presentation transcript:

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This work is licensed under a Creative Commons Attribution 3.0 Unported License:

Computer Science and Engineering Laboratory, What you will learn today What components does a TTA processor constitute of What TTA programs look like in machine code Basic optimization of TTA programs

Computer Science and Engineering Laboratory, Transport-triggered architecture Transport-triggered architecture (TTA) processors An evolution of the VLIW Only 1 instruction: move data  Compiler needs to do a lot of work Can be very efficient Easy to design, scalable

Computer Science and Engineering Laboratory, Transport-triggered architecture + * RFIO instr. unit Transport bus Function unit

Computer Science and Engineering Laboratory, TTAs do not have an instruction set, instead, the programmer (compiler) directly defines data transports between functional units RISC, CISC and VLIW processor move data between FUs through registers. A TTA can directly send data from one FU to another – possibility to save power Transport-triggered architecture

Computer Science and Engineering Laboratory, The general architecture of a TTA processor is very scalable: adding a new functional unit increases the complexity linearly The VLIW problem that TTA does not directly solve, is that of code density Transport-triggered architecture

Computer Science and Engineering Laboratory, TTA structure

Computer Science and Engineering Laboratory, TTA processors + * RFIO instr. unit Transport bus Function unit Socket

Computer Science and Engineering Laboratory, TTA processors * Function units connect to sockets through ports

Computer Science and Engineering Laboratory, TTA processors * Function units connect to sockets through ports Ports have either input or output direction This multiplier has two inputs for operands and one output for the result One of the inputs always triggers the FU

Computer Science and Engineering Laboratory, Computation example

Computer Science and Engineering Laboratory, Computation example + * RFIO instr. unit a = READ_IO(); b = READ_IO(); c = a + b * b; WRITE_IO(c);

Computer Science and Engineering Laboratory, Computation example + * RFIO instr. unit a = READ_IO(); b = READ_IO(); c = a + b * b; WRITE_IO(c); mov IO(0) -> RF(a0); IO(0) is used to read data from outside mov IO(0) -> RF(a1); RF is a register file, to store data mov RF(a1) -> mul(0); mul(0) stores operand 1 of the multiplier mov RF(a1) -> mul(1); mul(1) stores operand 2 and triggers mov mul(2) -> RF(a2); mul(2) provides the multiplication result mov RF(a0) -> add(0); add(0) stores operand 1 of the adder mov RF(a2) -> add(1); b*b was stored to RF(a2) two lines before mov add(2) -> IO(1); IO(2) writes data to the outside

Computer Science and Engineering Laboratory, Computation example + * RFIO instr. unit a = READ_IO(); b = READ_IO(); c = a + b * b; WRITE_IO(c); mov IO(0) -> RF(a0); IO(0) is used to read data from outside mov IO(0) -> RF(a1); RF is a register file, to store data mov RF(a1) -> mul(0); mul(0) stores operand 1 of the multiplier mov RF(a1) -> mul(1); mul(1) stores operand 2 and triggers mov mul(2) -> RF(a2); mul(2) provides the multiplication result mov RF(a0) -> add(0); add(0) stores operand 1 of the adder mov RF(a2) -> add(1); b*b was stored to RF(a2) two lines before mov add(2) -> IO(1); IO(2) writes data to the outside

Computer Science and Engineering Laboratory, Computation example + * RFIO instr. unit a = READ_IO(); b = READ_IO(); c = a + b * b; WRITE_IO(c); mov IO(0) -> RF(a0); IO(0) is used to read data from outside mov IO(0) -> RF(a1); RF is a register file, to store data mov RF(a1) -> mul(0); mul(0) stores operand 1 of the multiplier mov RF(a1) -> mul(1); mul(1) stores operand 2 and triggers mov mul(2) -> RF(a2); mul(2) provides the multiplication result mov RF(a0) -> add(0); add(0) stores operand 1 of the adder mov RF(a2) -> add(1); b*b was stored to RF(a2) two lines before mov add(2) -> IO(1); IO(2) writes data to the outside

Computer Science and Engineering Laboratory, Computation example + * RFIO instr. unit a = READ_IO(); b = READ_IO(); c = a + b * b; WRITE_IO(c); mov IO(0) -> RF(a0); IO(0) is used to read data from outside mov IO(0) -> RF(a1); RF is a register file, to store data mov RF(a1) -> mul(0); mul(0) stores operand 1 of the multiplier mov RF(a1) -> mul(1); mul(1) stores operand 2 and triggers mov mul(2) -> RF(a2); mul(2) provides the multiplication result mov RF(a0) -> add(0); add(0) stores operand 1 of the adder mov RF(a2) -> add(1); b*b was stored to RF(a2) two lines before mov add(2) -> IO(1); IO(2) writes data to the outside

Computer Science and Engineering Laboratory, Computation example + * RFIO instr. unit a = READ_IO(); b = READ_IO(); c = a + b * b; WRITE_IO(c); mov IO(0) -> RF(a0); IO(0) is used to read data from outside mov IO(0) -> RF(a1); RF is a register file, to store data mov RF(a1) -> mul(0); mul(0) stores operand 1 of the multiplier mov RF(a1) -> mul(1); mul(1) stores operand 2 and triggers mov mul(2) -> RF(a2); mul(2) provides the multiplication result mov RF(a0) -> add(0); add(0) stores operand 1 of the adder mov RF(a2) -> add(1); b*b was stored to RF(a2) two lines before mov add(2) -> IO(1); IO(2) writes data to the outside

Computer Science and Engineering Laboratory, Computation example + * RFIO instr. unit a = READ_IO(); b = READ_IO(); c = a + b * b; WRITE_IO(c); mov IO(0) -> RF(a0); IO(0) is used to read data from outside mov IO(0) -> RF(a1); RF is a register file, to store data mov RF(a1) -> mul(0); mul(0) stores operand 1 of the multiplier mov RF(a1) -> mul(1); mul(1) stores operand 2 and triggers mov mul(2) -> RF(a2); mul(2) provides the multiplication result mov RF(a0) -> add(0); add(0) stores operand 1 of the adder mov RF(a2) -> add(1); b*b was stored to RF(a2) two lines before mov add(2) -> IO(1); IO(2) writes data to the outside

Computer Science and Engineering Laboratory, Computation example + * RFIO instr. unit a = READ_IO(); b = READ_IO(); c = a + b * b; WRITE_IO(c); mov IO(0) -> RF(a0); IO(0) is used to read data from outside mov IO(0) -> RF(a1); RF is a register file, to store data mov RF(a1) -> mul(0); mul(0) stores operand 1 of the multiplier mov RF(a1) -> mul(1); mul(1) stores operand 2 and triggers mov mul(2) -> RF(a2); mul(2) provides the multiplication result mov RF(a0) -> add(0); add(0) stores operand 1 of the adder mov RF(a2) -> add(1); b*b was stored to RF(a2) two lines before mov add(2) -> IO(1); IO(2) writes data to the outside

Computer Science and Engineering Laboratory, Computation example + * RFIO instr. unit a = READ_IO(); b = READ_IO(); c = a + b * b; WRITE_IO(c); mov IO(0) -> RF(a0); IO(0) is used to read data from outside mov IO(0) -> RF(a1); RF is a register file, to store data mov RF(a1) -> mul(0); mul(0) stores operand 1 of the multiplier mov RF(a1) -> mul(1); mul(1) stores operand 2 and triggers mov mul(2) -> RF(a2); mul(2) provides the multiplication result mov RF(a0) -> add(0); add(0) stores operand 1 of the adder mov RF(a2) -> add(1); b*b was stored to RF(a2) two lines before mov add(2) -> IO(1); IO(2) writes data to the outside

Computer Science and Engineering Laboratory, Computation example + * RFIO instr. unit a = READ_IO(); b = READ_IO(); c = a + b * b; WRITE_IO(c); mov IO(0) -> RF(a0); IO(0) is used to read data from outside mov IO(0) -> RF(a1); RF is a register file, to store data mov RF(a1) -> mul(0); mul(0) stores operand 1 of the multiplier mov RF(a1) -> mul(1); mul(1) stores operand 2 and triggers mov mul(2) -> RF(a2); mul(2) provides the multiplication result mov RF(a0) -> add(0); add(0) stores operand 1 of the adder mov RF(a2) -> add(1); b*b was stored to RF(a2) two lines before mov add(2) -> IO(1); IO(2) writes data to the outside

Computer Science and Engineering Laboratory, Computation example + * RFIO instr. mem a = READ_IO(); b = READ_IO(); c = a + b * b; WRITE_IO(c); mov IO(0) -> RF(a0); IO(0) is used to read data from outside mov IO(0) -> RF(a1); RF is a register file, to store data mov RF(a1) -> mul(0); mul(0) stores operand 1 of the multiplier mov RF(a1) -> mul(1); mul(1) stores operand 2 and triggers mov mul(2) -> RF(a2); mul(2) provides the multiplication result mov RF(a0) -> add(0); add(0) stores operand 1 of the adder mov RF(a2) -> add(1); b*b was stored to RF(a2) two lines before mov add(2) -> IO(1); IO(2) writes data to the outside The program below is not optimal. What could be done better?

Computer Science and Engineering Laboratory, Computation example + * RFIO instr. mem a = READ_IO(); b = READ_IO(); c = a + b * b; WRITE_IO(c); mov IO(0) -> RF(a0); IO(0) is used to read data from outside mov IO(0) -> RF(a1); RF is a register file, to store data mov RF(a1) -> mul(0); mul(0) stores operand 1 of the multiplier mov RF(a1) -> mul(1); mul(1) stores operand 2 and triggers mov mul(2) -> RF(a2); mul(2) provides the multiplication result mov RF(a0) -> add(0); add(0) stores operand 1 of the adder mov RF(a2) -> add(1); b*b was stored to RF(a2) two lines before mov add(2) -> IO(1); IO(2) writes data to the outside The program below is not optimal. What could be done better? Circulating the data through RF is not necessary!

Computer Science and Engineering Laboratory, Multiple buses + * RFIO instr. unit This TTA processor has one bus. How would the functionality of the processor change if there would be a second bus?

Computer Science and Engineering Laboratory, Multiple buses + * RFIO instr. unit Every additional bus adds a possibility for another parallel transfer

Computer Science and Engineering Laboratory, Multi-bus example + * RFIO instr. mem Cycle 0Cycle 1Cycle 2Cycle 3 Bus 0 Bus 1 Bus 2 Bus 3

Computer Science and Engineering Laboratory, Multi-bus example + * RFIO instr. unit Cycle 0Cycle 1Cycle 2Cycle 3 Bus 0 mov IO(o1)  add(i1) Bus 1 mov RF(o1)  add(i2) Bus 2... Bus 3...

Computer Science and Engineering Laboratory, Multi-bus example + * RFIO instr. unit Cycle 0Cycle 1Cycle 2Cycle 3 Bus 0 mov IO(o1)  add(i1) Bus 1 mov RF(o1)  add(i2) Bus 2... mov add(o1)  mul(i1) Bus 3... mov mul(o1)  mul(i2)

Computer Science and Engineering Laboratory, Multi-bus example + * RFIO instr. unit Cycle 0Cycle 1Cycle 2Cycle 3 Bus 0 mov IO(o1)  add(i1) mov add(o1)  IO(i1) Bus 1 mov RF(o1)  add(i2) mov IO(o1)  add(i1) Bus 2... mov add(o1)  mul(i1)mov RF(o1)  add(i2) Bus 3... mov mul(o1)  mul(i2)...

Computer Science and Engineering Laboratory, Multi-bus example + * RFIO instr. unit Cycle 0Cycle 1Cycle 2Cycle 3 Bus 0 mov IO(o1)  add(i1) mov add(o1)  IO(i1)mov mul(o1)  IO(i1) Bus 1 mov RF(o1)  add(i2) mov IO(o1)  add(i1)mov add(o1)  RF(i1) Bus 2... mov add(o1)  mul(i1)mov RF(o1)  add(i2)... Bus 3... mov mul(o1)  mul(i2)... mov IO(o1)  add(i1)

Computer Science and Engineering Laboratory, Multiple buses + * RFIO instr. unit Going into detail, all sockets are actually not connected to every bus. Less connections means lower power consumption.

Computer Science and Engineering Laboratory, TTA instructions

Computer Science and Engineering Laboratory, TTA instructions + * RFIO instr. unit But how do the TTA instructions look like in binary format?

Computer Science and Engineering Laboratory, TTA instructions + * RFIO instr. unit bits for one instruction  42 bits for each bus

Computer Science and Engineering Laboratory, TTA instructions Each bus needs a 42 bit instruction each clock cycle. Where do the 42 bits come from? - How wide is an 8-bus TTA instruction?

Computer Science and Engineering Laboratory, TTA instructions Each bus needs a 42 bit instruction each clock cycle. Where do the 42 bits come from? -source port -destination port -opcode -guard bits -immediate values How wide is an 8-bus TTA instruction? 336b

Computer Science and Engineering Laboratory, TTA instructions Bus 1Bus 2Bus 3Bus 4Immed. guardsourcedest Instruction word

Computer Science and Engineering Laboratory, TTA instructions Very long instruction words (like 168 or 336 bits) require a lot of program memory space if the program is long To make the problem less severe, instruction compression techniques exist Instruction compression is based on a dictionary: compressed instructions are just index number that point to the full instruction in the dictionary

Computer Science and Engineering Laboratory, Performance optimization

Computer Science and Engineering Laboratory, Performance optimization The SW/HW designer of TTA processors must know the central issues about performance optimization How the algorithm works What resources the algorithm needs Understand how the C compiler works

Computer Science and Engineering Laboratory, Performance optimization The strength of TTA processors is that they can directly route data from one place to another, without obligatory register/memory stores Memory accesses are slow  the program should only access data memory when really necessary

Computer Science and Engineering Laboratory, Performance optimization The TTA processor for this code should have so much register space that memory accesses are not needed for this loop

Computer Science and Engineering Laboratory, Performance optimization By examining the assembly code (output of the C compiler), one can see if the loop has accesses the load-store unit (LSU). If it does, memory is accessed

Computer Science and Engineering Laboratory, Performance optimization By examining the assembly code (output of the C compiler), one can see if the loop has accesses the load-store unit (LSU). If it does, memory is accessed Bus 1Bus 2Bus 3Bus 4

Computer Science and Engineering Laboratory, Performance optimization The functionality of a signal processor must be balanced for high efficiency (low gate count, high throughput) FIR example: You start with a processor that has 1 multiplier and 1 adder. You want to make the processor 3 times faster.  if you make the processor have 3 multipliers, you probably also need 3 adders

Computer Science and Engineering Laboratory, Performance optimization Profiling tools are used to see if the processor is balanced Things to look for: –if there is a FU that is used much more often than others, it probably is a bottleneck –if there is a FU that has (almost) no accesses, it can be removed to save on gate count