ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay,

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

Dr. Rabie A. Ramadan Al-Azhar University Lecture 3

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Forwarding and Hazards MemberRole William ElliottTeam Leader Jessica Tyler ShulerWiki Specialist Tyler KimseyLead Engineer Cameron CarrollEngineer Danielle.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Performed by: Tziki Oz-Sinay, Ori Lempel Instructor: Rony Mitleman המעבדה למערכות ספרתיות מהירות High Speed Digital Systems Laboratory הטכניון - מכון טכנולוגי.

11-May-04 Qianyi Zhang School of Computer Science, University of Birmingham (Supervisor: Dr Georgios Theodoropoulos) A Distributed Colouring Algorithm.

Part 2 - Data Hazards and Forwarding 3/24/04++

Performed by: Rea Yeheskel Yair Sommer Instructor: Idan Katz Cooperated with:Motorola הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion - Israel.

Performed by: Yair Sommer Rea Yeheskel Instructor: Idan Katz In Cooperation with:Motorola הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion -

ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay, Ori.

Performed by: Farid Ghanayem & Jihad Zahdeh Instructor: Ina Rivkin המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון טכנולוגי.

Look Up Machine Mid Semester Presentation Chanit Giat Rachel Stahl Instructor: Artyom Borzin הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות.

Asynchronous Pipelined Ring Interconnection for SoC Final Presentation One semester project, Spring 2005 Supervisor: Nitzan Miron Students: Ziv Zeev Shwaitser.

Proxy Cache Engine Performed by:Artyom Borzin Stas Lapchev Stas Lapchev Instructor: Hen Broodney In cooperation with Magnifier Ltd. הטכניון - מכון טכנולוגי.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

Performed by: Yair Sommer Rea Yeheskel Instructor: Idan Katz Cooperated with:Motorola הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion - Israel.

LOOKUP MACHINE characterization Chanit Giat Rachel Stahl Instructor: Artyom Borzin הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

ARM – Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay, Ori.

Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

Computer Structure - Datapath and Control Goal: Design a Datapath  We will design the datapath of a processor that includes a subset of the MIPS instruction.

ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay, Ori.

Performed by: Yair Sommer Rea Yeheskel Instructor: Idan Katz In Cooperation with:Motorola הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion -

Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

XML Protocol for Manipulation and Control Elias Khoury Othman Kanj Final Presentation Supervisor: Mony Orbach הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת.

EEC4133 Computer Organization & Architecture Chapter 6: Languages and the Machine by Muhazam Mustapha, May 2014.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

LZRW3 Data Compression Core Dual semester project April 2013 Project part A final presentation Shahar Zuta Netanel Yamin Advisor: Moshe porian.

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

CMPE 421 Parallel Computer Architecture

Finishing out EECS 470 A few snapshots of the real world.

Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the FPX.

TEAM FRONT END ECEN 4243 Digital Computer Design.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

Performed by:Gidi Getter, Shir Borenstein Supervised by:Ina Rivkin המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון טכנולוגי.

CSCI1600: Embedded and Real Time Software Lecture 16: Advanced Programming with I/O Steven Reiss, Fall 2015.

Performed by: Lior Raviv & Zohar koritzki Instructor: Reuven Nisar הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion - Israel institute of technology.

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

Computer Organization

/ Computer Architecture and Design

Smruti R. Sarangi IIT Delhi

OOO Execution of Memory Operations

OOO Execution of Memory Operations

CDA 3101 Spring 2016 Introduction to Computer Organization

Pipelining: Advanced ILP

Introduction to cosynthesis Rabi Mahapatra CSCE617

Superscalar Processors & VLIW Processors

Smruti R. Sarangi IIT Delhi

ECE 2162 Reorder Buffer.

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Control unit extension for data hazards

T.H.A.D.D. GROUP TOM DUAN HELEN YU ANDY LEE DANNY HUANG DAWEY HUANG

Instruction Execution Cycle

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Mattan Erez The University of Texas at Austin

Control unit extension for data hazards

Conceptual execution on a processor which exploits ILP

Presentation transcript:

ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay, Ori Lempel Supervised by: Rony Mitleman Final Presentation

Milestones Reached First Semester –Thorough ramp-up made on asynchronous circuit design – algorithms and methodologies. –Development platform selected (Balsa over Petrify). –ARMOR architecture defined. –Detailed Micro-Architecture Specification written – functional block partition, datapath interface defined. Asynchronous handshaking protocol defined. –Detailed asynchronous pseudo-code implementation written. Second Semester –Balsa code written and compiled for each functional ARMOR unit. –Light validation started on all units, test harness piloted on IFU. –Attempts made at full-chip integration.

Main Problems Encountered Straightforward implementation of asynchronous psaudo- code in Balsa does not always work: –Deadlocks originating in forced deviation from psaudo-code. –False read after write hazards. Arbitration in Balsa is limited to two inputs only, which necessitates large arbitration trees. Balsa validation environment does not allow the synchronization of events. Customized test harnesses had to be built. Balsa simulator does not filter main channels and thus is almost impossible to follow – huge obstacle in full-chip integration.

Deadlock Illustration Function1 A<- Reg1; B<- Reg2 Function2 A-> Reg1; B-> Reg2

Function1 A<- Reg1; B<- Reg2 Function2 B-> Reg2; A-> Reg1 Function1 A<- Reg1; B<- Reg2 Deadlock Illustration

UArch Deadlock Example PSrc Data Readiness Determination

OutOfOrder execution implies that not all instructions will be data-ready when entering the execution window (i.e. having been renamed and registered in the ROB). PSrc data readiness is determined by reading its corresponding RAT entry ready-bit during the renaming stage: –If the ready bit is set – the PSrc data is guaranteed to be ready for execution and its valid value can be safely copied to the relevant ROB entry field. –If the ready bit is clear – the PSrc data is NOT yet ready and its value can NOT be copied until a WB CAM-match occurs vs. that same PSrc PSrc Data Readiness Determination UARCH REQUIREMENT

Consider the following sequence of events: –A certain PSrc, whose data is NOT ready, is read from the RAT. –After that PSrc’s (clear !) ready-bit is read, its PDst is written back to the ROB, having completed execution, thus validating its data. –Only now is the PSrc read from the RAT registered (together with its corresponding instruction) in the ROB, and because its ready bit is clear, its data is deemed not- ready. –The instruction registered in that ROB entry will forever wait for a WB CAM-match vs. the said PSrc, which had already occurred. This results in a machine deadlock !!! PSrc Data Readiness Determination UARCH REQUIREMENT

PSrc Data Readiness Determination SYNCHRONOUS LOGIC IMPLEMENTATION For simplicity we assume all register files are read during the high clock phase and written during the low clock phase. WB data always arrives within the same timing window  an override datapath can be designed, that meets the ROB array-write timing requirement. FetchDecodeRenameScheduleDispatchExecuteMemoryWBRetire WB data forwarding

PSrc Data Readiness Determination SYNCHRONOUS LOGIC IMPLEMENTATION clk LSrc [2:0] rden [7:0] clk# wren [31:0] rddec [7:0] wrdec [31:0] RAT Array ready PDst [4:0] WBPDst [4:0] ROB Array wrdata

PSrc Data Readiness Determination ASYNCHRONOUS LOGIC IMPLEMENTATION In an asynchronous pipeline WB can occur at any given time, and cannot be contained within a predetermined timing window  the previous override approach will not work !!! Instead, after completing an instruction’s registration in the ROB, another check will be performed on the data readiness of each PSrc, whose RAT data-ready bit was clear during the renaming process.

PSrc Data Readiness Determination ASYNCHRONOUS LOGIC IMPLEMENTATION RATROB Head Pointer Issue Req OpCode, LDst, Imm, PSrc1, PSrc1Ready, PSrc2, PSrc2Ready Issue Ack

PSrc Data Readiness Determination ASYNCHRONOUS LOGIC IMPLEMENTATION ROB PSrc1PSrc210 0 Result PSrc1 ready-bit was set and PSrc2 ready-bit was clear during renaming PSrc1 PSrc2 Ready Valid 1

PSrc Data Readiness Determination ASYNCHRONOUS LOGIC IMPLEMENTATION ROB PSrc1PSrc2PSrc1Value10 Result PSrc1 result is valid and thus copied to the PSrc1Value field At the same time, PSrc2 writes back PSrc1 PSrc2 Ready 1 Valid 1 Result

PSrc Data Readiness Determination ASYNCHRONOUS LOGIC IMPLEMENTATION ROB PSrc1PSrc2PSrc2ValuePSrc1Value11 1 Result A final check made on the status of PSrc2 deems it valid, thus its result is copied into the PSrc2Value field and its ready-bit is set. PSrc1 PSrc2 Ready Valid

False Read After Write Hazard and Resulting Arbitration Overhead ROB Interface Arbitration

ROB Interface Arbitration UARCH REQUIREMENT ROB RRF RAT RS0RS1 ALU0ALU1 DATA CACHE BranchDecision to IFU Inst from ID branches non-mem inst mem inst non-branch inst

ROB Interface Arbitration UARCH REQUIREMENT The ROB constitutes the heart of the OutOfOrder execution engine. Its functions include: –Holding all instructions currently in the execution window (issue  retirement). –Determining data-readiness of each instruction by CAM- matching the 3 WB busses vs. each entry’s PSrc pointers. –Dispatching data-ready instructions out-of-order to the appropriate RS. –Retire PDsts of executed instructions in-order to the Real Register File (RRF).

ROB Interface Arbitration ASYNCHRONOUS LOGIC DESIGN Our initial logic design comprised 8 independent concurrent asynchronous logical processes: –Issue process – requests a new instruction from the RAT and registers it in the ROB head entry. –Three RS dispatch processes scanning the ROB from tail to head, scheduling and dispatching instructions for execution – Branch Iterator – searches for the oldest branch instruction yet to be dispatched. Memory Iterator - searches for the oldest memory instruction yet to be dispatched. RegOp Iterator - searches for the oldest data-ready, non- branch/memory instruction yet to be dispatched.

ROB Interface Arbitration ASYNCHRONOUS LOGIC DESIGN –Three CAM-match processes per entry, comparing the entry’s PDst and PSrc pointers vs. the 3 PDsts writing back from the ALUs and DCache. –Retirement process – serially copies the result of completed instructions from the ROB’s tail to the RRF.

ROB Interface Arbitration ASYNCHRONOUS LOGIC DESIGN These processes were conceived and designed in such a way, as to eliminate any potential read-write or write-write logical hazards: –Issue process – writes to entries within the ROB head  tail region, reads from valid entries only. –Branch RS iterator – reads/writes from/to invalid entries within the ROB tail  head region, whose op-code is BEZ, which has yet to be dispatched. –Memory RS iterator – reads/writes from/to invalid entries within the ROB tail  head region, whose op-code is LW/SW, which has yet to be dispatched. –RegOp RS iterator – reads/writes from/to invalid entries within the ROB tail  head region, whose op-code is NOT BEZ/LW/SW, which has yet to be dispatched.

ROB Interface Arbitration ASYNCHRONOUS LOGIC DESIGN –CAM-match processes – write to DIFFERENT, invalid, entries within the ROB tail  head region, which have already been dispatched. –Retirement process – reads from valid entries within the ROB tail  head region.

ROB Interface Arbitration ASYNCHRONOUS LOGIC DESIGN [4:0][3:0] [15:0] OpCodeLDst PSrc2Val PSrc2RdyPSrc2 Valid Imm/ResDisp PSrc1Val PSrc1RdyPSrc1 PDst [4:0] [2:0] PDst0 PDst1 PDst23 head tail PDst22 PDst21 PDst20 PDst2 PDst3 PDst4 PDst5 BEZ LW/SW MOV ADD 00 Issue ProcessBranch RS IteratorMemory RS IteratorCAM-Match ProcessesRetirement ProcessRegOp Iterator

Arbitrate Tree Problem: If a function has more than one input, which can come at any time. Two inputs might come at the same time. Balsa does not allow it !!! Solution: We arbitrate the input. Another problem: Balsa can not implement more than two inputs. Solution: We Build an arbitrate tree. How Does it works?

Arbitrate Tree The base element: 1 0

Arbitrate tree

Balsa Arch Check the tail position Ask to read one of the Entries Arbitrate Every entry checks if the call ask for it Who is the sender And what is the request We arbitrate the results And send it to all the iterators We check where to send the Command in case it is ready to go And sends it !!! Update the entry and move on !!!

Test Harnesses and Dynamic Simulations

Test Harness Problem: –The Balsa-mgr supplies a test environment which does not enable synchronization of the different inputs (We can not control the sequence of events), i.e. input A can’t come after input B (logically). We know it but the Balsa environment does not, and we cannot control it. –As a result the test wont work properly. Solution: –We had to build a test harness, so that we can control the sequence of events.

IFU Test Harness What are we going to see?

IFU Test Harness IFU sends command cache first address: 0 IFU sends REQUEST to command cache Command cache respond with “ack” and receive the data IFU sends request, signaling he is ready to receive data Command cache sends data with “ack” signal IFU check the data, and (in this case) sends the data to the ID. Validation signal, external to the harness Same scenario goes again, from next address

ALU Test example RS sends data to be execute, ALU receives it ALU “call” the relevant PDst ALU calculate the result and sends it to the ROB. 5+7=12 !!! {ADD,3,7,5,0} Command: add PDst: 3 Src1: 7 Src2: 5 Imm: 0

Conclusion and Future Plans

Conclusions Asynchronous Circuit Design –Theoretical advantages of asynchronous circuits (reduced power, clock skew elimination, average-case performance), are negated by huge overhead caused by hand-shake logic. –Large scale systems are almost impossible to implement on standard (synchronous) development platforms (FPGA, etc.). Project Concept –Project scope and definition was exaggerated – OOO chip design, implementation, validation and synthesis. –The Balsa platform was very disappointing in term of design, simulation and synthesis capabilities, as well as support options (ManU junior faculty only…). –Main line of project should have been performed on platforms that are well known in the lab (VHDL !!!).

Future Plans Balsa code documentation and packaging. Completion of light validation on all ARMOR logical units via test harnesses for checking/simulating basic functionality. Project documentation completion.