Download presentation

Presentation is loading. Please wait.

Published byStella Carls Modified about 1 year ago

1
VLIW-DLX Simulator Milos Becvar and Stanislav Kahanek Faculty of Electrical Engineering Czech Technical University in Prague

2
Presentation Outline Undergraduate Comp. Arch. Course Experience with WinDLX VLIW-DLX Simulation Model VLIW-DLX Simulator Features Example of Program Planned use in comp. arch. Course Future work

3
X36APS Course Content Undergraduate course intended for CS/CE students, follow- up to digital design and basic computer organization course. 90 minutes lecture + 90 minutes lab/seminar per week students per semester. Introduction, computer performance – 1 lecture ISA – 2 lectures Pipelining of RISC – 2 lectures Memory subsystem – 2 lectures Intro. to ILP - Superscalar, VLIWs – 2 lectures Data parallelism – vector computers – 1 lecture Multiprocessors, coherency on SMP – 2 lectures

4
X36APS Seminars and Labs Goal is to complement lectures with additional experience with presented topics: 1.Using visualization simulators (WinDLX, HDLDLX, SMPCache) 2.Running benchmarks and evaluating various trade-offs (SPEC benchmarks, Dinero) 3.“Table and chalk” seminars about topics where simulators are not available (cache design, vector computers) Visualization simulators prove to be the most efficient way for student interaction with the topics.

5
Good Experience with WinDLX

6
WinDLX in X36APS Course Used to demonstrate correspondence between C source code and assembly program in DLX ISA, importance of GCC optimization (1 week in class) Used to demonstrate loop unrolling to improve speed of execution on DLX (1 week in class) Matrix multiplication program (3 weeks homework)

7
Matrix Multiplication Program Write a program for the DLX processor that will compute a product of two square matrices of dimension N. Optimize this program for the given processor parameters so as to achieve as low execution time as possible. Result Rating (for N=10) Clock CyclesPoints > <68006 Competition to achieve best result limits cheating. For achieving full number of points, students have to employ unrolling of the right loop and schedule instructions to eliminate stalls. Register constrains are necessary to prohibit a brutal-force approach to a solution. (e.g. completely eliminating inner loop by unrolling it 10 times)

8
VLIW-DLX Goals A tool similar to WinDLX to illustrate basics of VLIWs. Show relationship between VLIWs and scalar pipelines Show relationship between software elimination of hazards by inserting NOPs into code and hardware solution by pipeline interlocks and stalls. Show that speedup achievable by extending pipeline width is limited and show sources of these limitations. Demonstrate software pipelining algorithm efficiency for VLIW and superscalar processors

9
Requirements for VLIW-DLX Simulator Similar ISA to DLX Similar GUI/features to WinDLX GUI Visualization of pipeline similar to WinDLX Must run in both Win/Linux environment (hence it is in Java)

10
VLIW-DLX Architecture

11
VLIW-DLX Features Currently no forwarding, all data transfers through a unified register file (Int. and FP registers). RAW and WAW hazards possible. Multiple write conflicts possible (later operation wins) Single branch allowed per VLIW instruction in pipeline slot 1 VLIW instruction following branch in the delay slot is always executed (branch is executed in the ID stage) Number and type of pipeline slots can be easily modified in simulator code. Operations are all DLX instructions except double precision FP instructions and division, new operations can be added easily.

12
VLIW-DLX Instructions bnez r3, loop lf f3,0(r2) sf -16(r2),f10 nop multf f2,f1,f2;; subi r3,r3,4 lf f5,4(r2) sf -12(r2),f11 nop multf f4,f1,f4;; DLX Instruction = VLIW-DLX Operation VLIW-DLX Instruction = Group of 5 DLX Instructions VLIW-DLX Instruction delimiter Pipeline 1 (Integer, Branch) Pipeline 2 (Integer, Load) Pipeline 3 (Load / Store) Pipeline 4 (Floating Point) Pipeline 5 (Multiplication)

13
VLIW-DLX Instructions Simple HW oriented representation of VLIW instructions Position in instruction corresponds to pipeline slot, operation type allowed is checked by compiler Explicit nops must be included in unused instruction slots (bundle concept or instruction compression is not used) Exchange of values between two registers is possible in a single VLIW instruction without intermediate storage register. r2 r1 add r1,r2,r0 add r2,r1,r0 nop nop nop ;; nop nop nop nop nop;; Full 5-slot NOP

14
VLIW-DLX Simulator Features Source code editor Register view Memory view Pipeline view

15
VLIW-DLX Simulator Features Same colors as WinDLX Shows in which stage is a given operation

16
VLIW-DLX Simulator Features Help shows which operations are now allowed in a given pipeline slot

17
VLIW-DLX Simulator Features Simulator shows register values read by a given operation in the ID stage. This helps to track RAW, WAW dependences.

18
VLIW-DLX Demonstration Program XPK (X Plus K): float x[100], k; for (i=0; i<100; i++) x[i]+=k;

19
XPK on Scalar DLX (main loop) loop: lf f1, 0(r1) addf f1, f1, f0 addui r1,r1,4 subui r3,r3,1 sf -4(r1),f1 bnez r3, loop

20
XPK on WinDLX Same latency as VLIW-DLX, no forwarding, 2 multicycle FP adders used to simulate a single pipelined FP adder Trivial 2x unrolled 4x unrolled Software Pipelined Instruction Count Cycles RAW stalls Control stalls Structural stalls IPC 0,550,690,88 CPI 1,831,441,131,14

21
Trivial XPK on VLIW-DLX Pipeline 1Pipeline 2Pipeline 3Pipeline 4 #1 loop:noplf f1, 0(r1)nop #2nop #3nop #4nop addf f1,f1,f0 #5subui r3,r3,1nop #6nopaddui r1,r1,4nop #7bnez r3, loopnop #8 (Delay slot)nop sf 0(r1), f1nop

22
2xUnrolled XPK on VLIW-DLX Pipeline 1Pipeline 2Pipeline 3Pipeline 4 #1 loop:noplf f1, 0(r1)nop #2noplf f2, 4(r1)nop #3nop #4nop addf f1,f1,f0 #5nop addf f2,f2,f0 #6subui r3,r3,2nop #7nopaddui r1,r1,8nop #8bnez r3, loopnopsf 0(r1), f1nop #9 (Delay slot)nop sf 4(r1), f2

23
Soft. Pipelined XPK on VLIW-DLX Pipeline 1Pipeline 2Pipeline 3Pipeline 4 #1 loop:noplf f1,28(r1)sf 0(r1),f1addf f1,f1,f0 #2addui r1,r1,16lf f1, 32(r1)sf 4(r1),f1addf f1,f1,f0 #3bnez r3,looplf f1,36(r1)sf 8(r1), f1addf f1,f1,f0 #4 (Delay sl.)subui r3,r3, 4lf f1,24(r1)sf -4(r1),f1addf f1,f1,f0 + Prolog and Epilog (not shown)

24
XPK Loop Performance

25
Pipeline Efficiency of VLIW-DLX

26
VLIW-DLX in X36APS Course Students will be introduced to VLIW-DLX within a single seminar They will try to implement a simple loop (similar to XPK) to learn how to use the tool and software pipelining They will be assigned a slightly more complex homework (SAXPY loop, matrix mult. kernel)

27
Summary and Future Work VLIW-DLX is a simple tool for introduction of VLIWs to undergraduate students It can be easily integrated into course based on DLX and also MIPS processors Similar tool is planned to replace aging WinDLX simulator. It will support also vector instructions. Our goal is to introduce all these concepts to undergraduate students within a common ISA framework

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google