ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay, Ori.

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Out-of-Order Machine State Instruction Sequence: Inorder State: Look-ahead State: Architectural State: R3  A R7  B R8  C R7  D R4  E R3  F R8  G.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.

National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Performed by: Tziki Oz-Sinay, Ori Lempel Instructor: Rony Mitleman המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון - מכון טכנולוגי.

ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay,

Performed by: Lin Ilia Khinich Fanny Instructor: Fiksman Eugene המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון - מכון טכנולוגי.

Performed by: Yael Grossman & Arik Krantz Instructor: Mony Orbach המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון טכנולוגי.

Performed by: Volokitin Vladimir Tsesis Felix Instructor: Mony Orbah המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון.

Performed by: Farid Ghanayem & Jihad Zahdeh Instructor: Ina Rivkin המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון טכנולוגי.

CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.

1 Performed by: Lin Ilia Khinich Fanny Instructor: Fiksman Eugene המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון - מכון טכנולוגי.

Performed by : Rivka Cohen and Sharon Solomon Instructor : Walter Isaschar המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Performed by: Ariel Wolf & Elad Bichman Instructor: Yuri Dolgin המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון טכנולוגי.

LOOKUP MACHINE characterization Chanit Giat Rachel Stahl Instructor: Artyom Borzin הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה.

ARM – Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay, Ori.

Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay, Ori.

Henry Hexmoor1 Chapter 10- Control units We introduced the basic structure of a control unit, and translated assembly instructions into a binary representation.

Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

A Configurable Simulator for OOO Speculative Execution Design & Implementation By Mustafa Imran Ali ID#

Device Driver for Generic ASC Module - Project Presentation - By: Yigal Korman Erez Fuchs Instructor: Evgeny Fiksman Sponsored by: High Speed Digital Systems.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

COMPUTER ORGANIZATION CSCE 230 Final Project. OVERVIEW  Implemented RISC processor  VHDL  Test program created to demonstrate abilities.

Computer Architecture Computer Architecture Superscalar Processors Ola Flygt Växjö University +46.

Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing.

Performed by:Yulia Turovski Lior Bar Lev Instructor: Mony Orbach המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון טכנולוגי.

Sample Code (Simple) Run the following code on a pipelined datapath: add1 2 3 ; reg 3 = reg 1 + reg 2 nand ; reg 6 = reg 4 & reg 5 lw ; reg.

CDA 3101 Fall 2013 Introduction to Computer Organization

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.

Datapath and Control Unit Design

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

/ Computer Architecture and Design

Smruti R. Sarangi IIT Delhi

ELEN 468 Advanced Logic Design

PowerPC 604 Superscalar Microprocessor

OOO Execution of Memory Operations

OOO Execution of Memory Operations

CS203 – Advanced Computer Architecture

Microprocessor Microarchitecture Dynamic Pipeline

Design of the Control Unit for Single-Cycle Instruction Execution

Smruti R. Sarangi IIT Delhi

ECE 2162 Reorder Buffer.

Design of the Control Unit for One-cycle Instruction Execution

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Krste Asanovic Electrical Engineering and Computer Sciences

* From AMD 1996 Publication #18522 Revision E

Instruction Execution Cycle

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Instruction-Level Parallelism (ILP)

A Configurable Simulator for OOO Speculative Execution

Conceptual execution on a processor which exploits ILP

COMS 361 Computer Organization

MIPS Processor.

Presentation transcript:

ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay, Ori Lempel Supervised by: Rony Mitleman Mid-Semester Presentation

Milestones Reached Development platform selected –Balsa over Petrify+VHDL Micro-Architecture Specification (MAS) completed –functional block partition, datapath interface defined –asynchronous handshaking protocol defined Detailed asynchronous pseudo-code implementation written Balsa code writing, dynamic simulation and synthesis started

Development Platform Selection Two development enviornments were examined: Balsa: Language for synthesising large asynchronous circuits and systems Compiles to a small, parametric, set of handshake components –Balsa flowBalsa flow –Balsa initial flowBalsa initial flow Petrify: A synthesis tool for Petri Nets and asynchronous controllers Reads a Petri Net and generates another Petri Net, which is simpler than the original description but behaviorally similar

Development Platform Selection (cont.) Balsa’s Advantages –One development environment  easier debugging and integration –Synthesis implements a delay-insensitive circuit  implementation is transparent to the developer (no need for timing analysis) –Control channels are automatically created at compilation –High level language  easier to learn

Petrify’s Advantages –A more mature environment than Balsa –When using Petrify the core of the system is written in VHDL  all of the tools/flows are well known and supported in the lab –Petrify’s output is translated to Verilog, while Balsa only supports EDIF synthesis  higher level output, compatible with Altera Development Platform Selection (cont.)

The Balsa Environment Was Chosen This constitutes new hardware requirements: –A simplified design, comprising an in-order pipeline and no external memory will be synthesized on a Xilinx Spartan FPGA –The complete design will later on be implemented on a Xilinx Vertex Pro II

REQ ACK DATAn REQ ACK DATA 4 Phase Protocol Handshake Protocol Push Channel REQ ACK DATAn Pull Channel

ARMOR Pipestages Instruction Cache Fetch Decode Rename Date Cache Write Back Execute Retire PC[15:0] Inst[15:0] VInst[15:0] Op[3:0] LDst[3:0] LSrc[3:0] Imm[11:0] Op[3:0] PDst[3:0] SrcVal1[15:0 ] SrcVal2[15:0] Imm[11:0] DataIn[15:0] PDst[3:0] Addr[15:0] ReadWrite# ALU0PDst[3:0] ALU0Res[15:0] ALU1PDst[3:0] ALU1Res[15:0] MemPDst[3:0] DataOut[15:0] LDst[3:0] Val15:0] Op[3:0] PDst[3:0] SrcVal1[15:0 ] SrcVal2[15:0] Imm[11:0] BranchDecision Out Of Order Engine

Instruction Fetch Unit (IFU) Function: –Fetch instruction pointed to by the PC register from the instruction cache. –Execute the jump instruction. –Calculate branch addresses, speculatively fetch branch target instructions and stall pipeline pending branch decision. + PC+2 branch offset branch instruction next instruction to instruction cache to ID branch decision

Instruction Decoder (ID) Function: –Tag instructions by type (REGREG, REGIMM, MEM, BRANCH). –Queue up to 4 issue-pending instructions, thus allowing continuous instruction fetching in case instruction issue stalls. V Inst head tail

ROB RRF RAT RS0RS1 ALU0ALU1 DATA CACHE BranchDecision to IFU Inst from ID branches non-mem inst mem inst non-branch inst Register Alias Table (RAT)

Function: –Register Renaming – map logical sources/ destinations to physical registers (ROB/RRF entries): Allocate physical destination (PDst) pointers during instruction issue Reset pointers during retirement (CAM-match logic) –Monitor data-readiness of physical sources/destinations: Reset ready-bit during instruction issue Set ready-bit during writeback (CAM-match logic) R0 R1 R2 R3 R4 R5 R6 R7 PDstReady

ROB RRF RAT RS0RS1 ALU0ALU1 DATA CACHE BranchDecision to IFU Inst from ID branches non-mem inst mem inst non-branch inst ReOrder Buffer (ROB)

Structure: Circular buffer of 24 entries (PDsts), each one holding all relevant data for a single instruction: –Op Code and Op Type –LDst –PSrc1 – pointer, value and status –PSrc2 – pointer, value and status (if needed) –Immediate (if needed) –Writeback Result –Dispatched, Valid bits  Large register file: 24 entries * 71 bits/entry = 1704 bits

Function: –Hold all instructions currently in the execution window (issue  retirement). –Determine data-readiness of each instruction by CAM-matching WB buses vs. entry’s PSrc pointers. –Dispatch data-ready instructions out-of-order to approriate RS (to be explained…☺). –Retire PDsts of executed instruction in-order to Real Register File (RRF).

Dispatch Algorithm: –3 independent iterators, scanning the ROB from tail to head: BranchRS Iterator – searches for the oldest branch instruction yet to be dispatched. MemRS Iterator – searches for the oldest memory instruction yet to be dispatched. RegOpRS Iterator – searches for the oldest data-ready non-branch/memory instruction yet to be dispatched. –Iterators’ independence does not cause conflicts  no need for arbitration ! –Problem: unbalanced dispatching can clog one ALU and starve the other, leading to diminished performance.

Dispatch Algorithm (cont.): –Solution: the ROB maintains a load-balance counter, ranging from -4 to 3: incremented upon branch issue and memory dispatch decremented upon memory issue and branch dispatch –The RegOpRS Iterator dispatches data-ready instructions according to the following rules: LoadBalancer < 0LoadBalancer > -1 RS0, RS1 availabledispatch to RS0; continuedispatch to RS1; continue RS0 available RS1 busy dispatch to RS0; return to Tailif no branch ops are ready, dispatch to RS0; return to Tail RS0 busy RS1 available if no memory ops are ready, dispatch to RS1; return to Tail RS0, RS1 busyreturn to Tail

ROB RRF RAT RS0RS1 ALU0ALU1 DATA CACHE BranchDecision to IFU Inst from ID branches non-mem inst mem inst non-branch inst Reservation Stations (RS)

PDst Src1 Src2V Imm Op Src1 Src2 Src1 Src2 Src1 Src2 Imm Op PDst V V V RS0 RS1 Branch Op Non Branch/Mem Op Mem Op Non Branch/Mem Op Reservation Stations (RS) Function: –Buffer data-ready instructions for both ALUs, so as to minimize (or even eliminate!) execution idle time –Sort instructions according to type/priority for each ALU: ALU0 – branch ops vs. non-branch/memory ops ALU1 – memory ops vs. non-branch/memory ops

ROB RRF RAT RS0RS1 ALU0ALU1 DATA CACHE BranchDecision to IFU Inst from ID branches non-mem inst mem inst non-branch inst ALUs

Function: –Continuously execute instructions from respective RS and drive their associated PDsts and results on the WB busses. –Prioritize instructions: branch ops have precedence over other ops on ALU0  result (branch decision) is driven to IFU memory ops have precedence over other ops on ALU1  result (address) is driven to DCache

ROB RRF RAT RS0RS1 ALU0ALU1 DATA CACHE BranchDecision to IFU Inst from ID branches non-mem inst mem inst non-branch inst Data Cache

Function: –Read/write memory operands in-order (according to address, RdWr# signal from ALU1) and drive their PDsts (and results, for LW ops) on the WB busses. –Queue up to 4 pending memory access instructions, thus allowing ALU1 to execute successive LW/SW ops without stalling. Data Cache

Timeline ASAP (beaurocracy…) –Install Balsa 3.3, including netlist technology, on Lion server –Increase Linux user quotas –Install Exceed terminal server in lab so that we can remotely connect to Lion server 4/3/04 (final report, first semester): –Asynchronous simulation of a complete data-path flow through the pipeline: mov R0, 1 add R0, 1

Balsa Initial Flow

Balsa Flow