Presentation on theme: "Faculty of Electrical Engineering Czech Technical University in Prague"— Presentation transcript:
1Faculty of Electrical Engineering Czech Technical University in Prague VLIW-DLX SimulatorMilos Becvar and Stanislav KahanekFaculty of Electrical EngineeringCzech Technical University in Prague
2Presentation Outline Undergraduate Comp. Arch. Course Experience with WinDLXVLIW-DLX Simulation ModelVLIW-DLX Simulator FeaturesExample of ProgramPlanned use in comp. arch. CourseFuture work
3X36APS Course Content Introduction, computer performance – 1 lecture Undergraduate course intended for CS/CE students, follow-up to digital design and basic computer organization course. 90 minutes lecture + 90 minutes lab/seminar per week students per semester.Introduction, computer performance – 1 lectureISA – 2 lecturesPipelining of RISC – 2 lecturesMemory subsystem – 2 lecturesIntro. to ILP - Superscalar, VLIWs – 2 lecturesData parallelism – vector computers – 1 lectureMultiprocessors, coherency on SMP – 2 lectures
4X36APS Seminars and LabsGoal is to complement lectures with additional experience with presented topics:Using visualization simulators (WinDLX, HDLDLX, SMPCache)Running benchmarks and evaluating various trade-offs (SPEC benchmarks, Dinero)“Table and chalk” seminars about topics where simulators are not available (cache design, vector computers)Visualization simulators prove to be the most efficient way for student interaction with the topics.
6WinDLX in X36APS CourseUsed to demonstrate correspondence between C source code and assembly program in DLX ISA, importance of GCC optimization (1 week in class)Used to demonstrate loop unrolling to improve speed of execution on DLX (1 week in class)Matrix multiplication program (3 weeks homework)
7Matrix Multiplication Program Write a program for the DLX processor that will compute a product of two square matrices of dimension N. Optimize this program for the given processor parameters so as to achieve as low execution time as possible.Result Rating (for N=10)Competition to achieve best result limits cheating.For achieving full number of points, students have to employ unrolling of the right loop and schedule instructions to eliminate stalls. Register constrains are necessary to prohibit a brutal-force approach to a solution. (e.g. completely eliminating inner loop by unrolling it 10 times)Clock CyclesPoints>12345<68006
8VLIW-DLX Goals A tool similar to WinDLX to illustrate basics of VLIWs. Show relationship between VLIWs and scalar pipelinesShow relationship between software elimination of hazards by inserting NOPs into code and hardware solution by pipeline interlocks and stalls.Show that speedup achievable by extending pipeline width is limited and show sources of these limitations.Demonstrate software pipelining algorithm efficiency for VLIW and superscalar processors
9Requirements for VLIW-DLX Simulator Similar ISA to DLXSimilar GUI/features to WinDLX GUIVisualization of pipeline similar to WinDLXMust run in both Win/Linux environment (hence it is in Java)
11VLIW-DLX FeaturesCurrently no forwarding, all data transfers through a unified register file (Int. and FP registers).RAW and WAW hazards possible. Multiple write conflicts possible (later operation wins)Single branch allowed per VLIW instruction in pipeline slot 1VLIW instruction following branch in the delay slot is always executed (branch is executed in the ID stage)Number and type of pipeline slots can be easily modified in simulator code.Operations are all DLX instructions except double precision FP instructions and division, new operations can be added easily.
13VLIW-DLX Instructions Simple HW oriented representation of VLIW instructionsPosition in instruction corresponds to pipeline slot, operation type allowed is checked by compilerExplicit nops must be included in unused instruction slots (bundle concept or instruction compression is not used)Exchange of values between two registers is possible in a single VLIW instruction without intermediate storage register. r r1add r1,r2,r0 add r2,r1,r0 nop nop nop ;;nop nop nop nop nop;;Full 5-slot NOP
14VLIW-DLX Simulator Features Memory viewPipeline viewRegister viewSource code editor
15VLIW-DLX Simulator Features Shows in which stage is a given operationSame colors as WinDLX
16VLIW-DLX Simulator Features Help shows which operations are now allowed in a given pipeline slot
17VLIW-DLX Simulator Features Simulator shows register values read by a given operation in the ID stage. This helps to track RAW, WAW dependences.
18VLIW-DLX Demonstration Program XPK (X Plus K):float x, k;for (i=0; i<100; i++)x[i]+=k;
20XPK on WinDLXSame latency as VLIW-DLX, no forwarding, 2 multicycle FP adders used to simulate a single pipelined FP adderTrivial2x unrolled4x unrolledSoftware PipelinedInstruction Count604454379371Cycles1104654429422RAW stalls400150Control stalls99492422Structural stalls12629IPC0,550,690,88CPI1,831,441,131,14
26VLIW-DLX in X36APS Course Students will be introduced to VLIW-DLX within a single seminarThey will try to implement a simple loop (similar to XPK) to learn how to use the tool and software pipeliningThey will be assigned a slightly more complex homework (SAXPY loop, matrix mult. kernel)
27Summary and Future Work VLIW-DLX is a simple tool for introduction of VLIWs to undergraduate studentsIt can be easily integrated into course based on DLX and also MIPS processorsSimilar tool is planned to replace aging WinDLX simulator. It will support also vector instructions.Our goal is to introduce all these concepts to undergraduate students within a common ISA framework