Presentation is loading. Please wait.

Presentation is loading. Please wait.

ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay,

Similar presentations


Presentation on theme: "ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay,"— Presentation transcript:

1

2 ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay, Ori Lempel Supervised by: Rony Mitleman Final Presentation

3 Milestones Reached First Semester –Thorough ramp-up made on asynchronous circuit design – algorithms and methodologies. –Development platform selected (Balsa over Petrify). –ARMOR architecture defined. –Detailed Micro-Architecture Specification written – functional block partition, datapath interface defined. Asynchronous handshaking protocol defined. –Detailed asynchronous pseudo-code implementation written. Second Semester –Balsa code written and compiled for each functional ARMOR unit. –Light validation started on all units, test harness piloted on IFU. –Attempts made at full-chip integration.

4 Main Problems Encountered Straightforward implementation of asynchronous psaudo- code in Balsa does not always work: –Deadlocks originating in forced deviation from psaudo-code. –False read after write hazards. Arbitration in Balsa is limited to two inputs only, which necessitates large arbitration trees. Balsa validation environment does not allow the synchronization of events. Customized test harnesses had to be built. Balsa simulator does not filter main channels and thus is almost impossible to follow – huge obstacle in full-chip integration.

5 Deadlock Illustration Function1 A<- Reg1; B<- Reg2 Function2 A-> Reg1; B-> Reg2

6 Function1 A<- Reg1; B<- Reg2 Function2 B-> Reg2; A-> Reg1 Function1 A<- Reg1; B<- Reg2 Deadlock Illustration

7 UArch Deadlock Example PSrc Data Readiness Determination

8 OutOfOrder execution implies that not all instructions will be data-ready when entering the execution window (i.e. having been renamed and registered in the ROB). PSrc data readiness is determined by reading its corresponding RAT entry ready-bit during the renaming stage: –If the ready bit is set – the PSrc data is guaranteed to be ready for execution and its valid value can be safely copied to the relevant ROB entry field. –If the ready bit is clear – the PSrc data is NOT yet ready and its value can NOT be copied until a WB CAM-match occurs vs. that same PSrc PSrc Data Readiness Determination UARCH REQUIREMENT

9 Consider the following sequence of events: –A certain PSrc, whose data is NOT ready, is read from the RAT. –After that PSrc’s (clear !) ready-bit is read, its PDst is written back to the ROB, having completed execution, thus validating its data. –Only now is the PSrc read from the RAT registered (together with its corresponding instruction) in the ROB, and because its ready bit is clear, its data is deemed not- ready. –The instruction registered in that ROB entry will forever wait for a WB CAM-match vs. the said PSrc, which had already occurred. This results in a machine deadlock !!! PSrc Data Readiness Determination UARCH REQUIREMENT

10 PSrc Data Readiness Determination SYNCHRONOUS LOGIC IMPLEMENTATION For simplicity we assume all register files are read during the high clock phase and written during the low clock phase. WB data always arrives within the same timing window  an override datapath can be designed, that meets the ROB array-write timing requirement. FetchDecodeRenameScheduleDispatchExecuteMemoryWBRetire WB data forwarding

11 PSrc Data Readiness Determination SYNCHRONOUS LOGIC IMPLEMENTATION clk LSrc [2:0] rden [7:0] clk# wren [31:0] rddec [7:0] wrdec [31:0] RAT Array ready PDst [4:0] WBPDst [4:0] 1 1 0 ROB Array wrdata

12 PSrc Data Readiness Determination ASYNCHRONOUS LOGIC IMPLEMENTATION In an asynchronous pipeline WB can occur at any given time, and cannot be contained within a predetermined timing window  the previous override approach will not work !!! Instead, after completing an instruction’s registration in the ROB, another check will be performed on the data readiness of each PSrc, whose RAT data-ready bit was clear during the renaming process.

13 PSrc Data Readiness Determination ASYNCHRONOUS LOGIC IMPLEMENTATION RATROB Head Pointer Issue Req OpCode, LDst, Imm, PSrc1, PSrc1Ready, PSrc2, PSrc2Ready Issue Ack

14 PSrc Data Readiness Determination ASYNCHRONOUS LOGIC IMPLEMENTATION ROB PSrc1PSrc210 0 Result PSrc1 ready-bit was set and PSrc2 ready-bit was clear during renaming PSrc1 PSrc2 Ready Valid 1

15 PSrc Data Readiness Determination ASYNCHRONOUS LOGIC IMPLEMENTATION ROB PSrc1PSrc2PSrc1Value10 Result PSrc1 result is valid and thus copied to the PSrc1Value field At the same time, PSrc2 writes back PSrc1 PSrc2 Ready 1 Valid 1 Result

16 PSrc Data Readiness Determination ASYNCHRONOUS LOGIC IMPLEMENTATION ROB PSrc1PSrc2PSrc2ValuePSrc1Value11 1 Result A final check made on the status of PSrc2 deems it valid, thus its result is copied into the PSrc2Value field and its ready-bit is set. PSrc1 PSrc2 Ready Valid

17 False Read After Write Hazard and Resulting Arbitration Overhead ROB Interface Arbitration

18 ROB Interface Arbitration UARCH REQUIREMENT ROB RRF RAT RS0RS1 ALU0ALU1 DATA CACHE BranchDecision to IFU Inst from ID branches non-mem inst mem inst non-branch inst

19 ROB Interface Arbitration UARCH REQUIREMENT The ROB constitutes the heart of the OutOfOrder execution engine. Its functions include: –Holding all instructions currently in the execution window (issue  retirement). –Determining data-readiness of each instruction by CAM- matching the 3 WB busses vs. each entry’s PSrc pointers. –Dispatching data-ready instructions out-of-order to the appropriate RS. –Retire PDsts of executed instructions in-order to the Real Register File (RRF).

20 ROB Interface Arbitration ASYNCHRONOUS LOGIC DESIGN Our initial logic design comprised 8 independent concurrent asynchronous logical processes: –Issue process – requests a new instruction from the RAT and registers it in the ROB head entry. –Three RS dispatch processes scanning the ROB from tail to head, scheduling and dispatching instructions for execution – Branch Iterator – searches for the oldest branch instruction yet to be dispatched. Memory Iterator - searches for the oldest memory instruction yet to be dispatched. RegOp Iterator - searches for the oldest data-ready, non- branch/memory instruction yet to be dispatched.

21 ROB Interface Arbitration ASYNCHRONOUS LOGIC DESIGN –Three CAM-match processes per entry, comparing the entry’s PDst and PSrc pointers vs. the 3 PDsts writing back from the ALUs and DCache. –Retirement process – serially copies the result of completed instructions from the ROB’s tail to the RRF.

22 ROB Interface Arbitration ASYNCHRONOUS LOGIC DESIGN These processes were conceived and designed in such a way, as to eliminate any potential read-write or write-write logical hazards: –Issue process – writes to entries within the ROB head  tail region, reads from valid entries only. –Branch RS iterator – reads/writes from/to invalid entries within the ROB tail  head region, whose op-code is BEZ, which has yet to be dispatched. –Memory RS iterator – reads/writes from/to invalid entries within the ROB tail  head region, whose op-code is LW/SW, which has yet to be dispatched. –RegOp RS iterator – reads/writes from/to invalid entries within the ROB tail  head region, whose op-code is NOT BEZ/LW/SW, which has yet to be dispatched.

23 ROB Interface Arbitration ASYNCHRONOUS LOGIC DESIGN –CAM-match processes – write to DIFFERENT, invalid, entries within the ROB tail  head region, which have already been dispatched. –Retirement process – reads from valid entries within the ROB tail  head region.

24 ROB Interface Arbitration ASYNCHRONOUS LOGIC DESIGN [4:0][3:0] [15:0] OpCodeLDst PSrc2Val PSrc2RdyPSrc2 Valid Imm/ResDisp PSrc1Val PSrc1RdyPSrc1 PDst [4:0] [2:0] PDst0 PDst1 PDst23 head tail PDst22 PDst21 PDst20 PDst2 PDst3 PDst4 PDst5 BEZ LW/SW 1 0 1 1 11 00 00 MOV ADD 00 Issue ProcessBranch RS IteratorMemory RS IteratorCAM-Match ProcessesRetirement ProcessRegOp Iterator

25 Arbitrate Tree Problem: If a function has more than one input, which can come at any time. Two inputs might come at the same time. Balsa does not allow it !!! Solution: We arbitrate the input. Another problem: Balsa can not implement more than two inputs. Solution: We Build an arbitrate tree. How Does it works?

26 Arbitrate Tree The base element: 1 0

27 Arbitrate tree 1 0 1 0

28 Balsa Arch Check the tail position Ask to read one of the Entries Arbitrate 1 0 1 0 1 0 1 0 1 0 Every entry checks if the call ask for it Who is the sender And what is the request We arbitrate the results And send it to all the iterators We check where to send the Command in case it is ready to go And sends it !!! Update the entry and move on !!!

29 Test Harnesses and Dynamic Simulations

30 Test Harness Problem: –The Balsa-mgr supplies a test environment which does not enable synchronization of the different inputs (We can not control the sequence of events), i.e. input A can’t come after input B (logically). We know it but the Balsa environment does not, and we cannot control it. –As a result the test wont work properly. Solution: –We had to build a test harness, so that we can control the sequence of events.

31 IFU Test Harness What are we going to see?

32 IFU Test Harness IFU sends command cache first address: 0 IFU sends REQUEST to command cache Command cache respond with “ack” and receive the data IFU sends request, signaling he is ready to receive data Command cache sends data with “ack” signal IFU check the data, and (in this case) sends the data to the ID. Validation signal, external to the harness Same scenario goes again, from next address

33 ALU Test example RS sends data to be execute, ALU receives it ALU “call” the relevant PDst ALU calculate the result and sends it to the ROB. 5+7=12 !!! {ADD,3,7,5,0} Command: add PDst: 3 Src1: 7 Src2: 5 Imm: 0

34 Conclusion and Future Plans

35 Conclusions Asynchronous Circuit Design –Theoretical advantages of asynchronous circuits (reduced power, clock skew elimination, average-case performance), are negated by huge overhead caused by hand-shake logic. –Large scale systems are almost impossible to implement on standard (synchronous) development platforms (FPGA, etc.). Project Concept –Project scope and definition was exaggerated – OOO chip design, implementation, validation and synthesis. –The Balsa platform was very disappointing in term of design, simulation and synthesis capabilities, as well as support options (ManU junior faculty only…). –Main line of project should have been performed on platforms that are well known in the lab (VHDL !!!).

36 Future Plans Balsa code documentation and packaging. Completion of light validation on all ARMOR logical units via test harnesses for checking/simulating basic functionality. Project documentation completion.


Download ppt "ARMOR Asynchronous RISC Microprocessor הטכניון - מכון טכנולוגי לישראל המעבדה למערכות ספרתיות מהירות הפקולטה להנדסת חשמל Submitted by: Tziki Oz-Sinay,"

Similar presentations


Ads by Google