HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA.

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA University of Southern California http://www-pdpc.usc.edu October 12 th, 2000

USC Parallel and Distributed Processing Center 2 HiDISC: Hierarchical Decoupled Instruction Set Computer New Ideas A dedicated processor for each level of the memory hierarchy Explicitly manage each level of the memory hierarchy using instructions generated by the compiler Hide memory latency by converting data access predictability to data access locality Exploit instruction-level parallelism without extensive scheduling hardware Zero overhead prefetches for maximal computation throughput Impact 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor 7.4x speedup for matrix multiply over an in-order issue superscalar processor 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) Allows the compiler to solve indexing functions for irregular applications Reduced system cost for high-throughput scientific codes Schedule April 98 Start April 99April 00 Defined benchmarks Completed simulator Performed instruction-level simulations on hand-compiled benchmarks Continue simulations of more benchmarks (SAR) Define HiDISC architecture Benchmark result Develop and test a full decoupling compiler Update Simulator Generate performance statistics and evaluate design Dynamic Database Memory Cache Registers Application (FLIR SAR VIDEO ATR /SLD Scientific ) Application (FLIR SAR VIDEO ATR /SLD Scientific ) Processor Decoupling Compiler HiDISC Processor Sensor Inputs Processor Situational Awareness

USC Parallel and Distributed Processing Center 3 HiDISC: Hierarchical Decoupled Instruction Set Computer Dynamic Database Memory Cache Registers Application (FLIR SAR VIDEO ATR /SLD Scientific ) Application (FLIR SAR VIDEO ATR /SLD Scientific ) Processor Decoupling Compiler HiDISC Processor Sensor Inputs Processor Situational Awareness Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994] Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading

USC Parallel and Distributed Processing Center 4 Present Solutions Solution Larger Caches Hardware Prefetching Software Prefetching Multithreading Limitations —Slow —Works well only if working set fits cache and there is temporal locality. —Cannot be tailored for each application —Behavior based on past and present execution-time behavior —Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching —Adaptive software prefetching is required to change prefetch distance during run-time —Hard to insert prefetches for irregular access patterns —Solves the throughput problem, not the memory latency problem

USC Parallel and Distributed Processing Center 5 Observation: Software prefetching impacts compute performance PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching The HiDISC Approach Approach: Add a processor to manage prefetching -> hide overhead Compiler explicitly manages the memory hierarchy Prefetch distance adapts to the program runtime behavior

USC Parallel and Distributed Processing Center 6 Cache What is HiDISC? A dedicated processor for each level of the memory hierarchy Explicitly manage each level of the memory hierarchy using instructions generated by the compiler Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch) Exploit instruction-level parallelism without extensive scheduling hardware Zero overhead prefetches for maximal computation throughput Access Instructions Computation Instructions Access Processor (AP) Access Processor (AP) Computation Processor (CP) Computation Processor (CP) Registers Cache Mgmt. Processor (CMP) Cache Mgmt. Processor (CMP) Cache Mgmt. Instructions CompilerProgram 2nd-Level Cache and Main Memory

USC Parallel and Distributed Processing Center 7 MIPS DEAP CAPPHiDISC (Conventional)(Decoupled)(New Decoupled) 2nd-Level Cache and Main Memory Registers 5-issue Cache Access Processor (AP) - (3-issue) Access Processor (AP) - (3-issue) 2-issue Cache Mgmt. Processor (CMP) Registers Access Processor (AP) - (5-issue) Access Processor (AP) - (5-issue) Computation Processor (CP) Computation Processor (CP) 3-issue 2nd-Level Cache and Main Memory Registers 8-issue Registers 3-issue Cache Mgmt. Processor (CMP) DEAP: [Kurian, Hulina, & Caraor ‘94] PIPE: [Goodman ‘85] Other Decoupled Processors: ACRI, ZS-1, WA Cache 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Cache Decoupled Architectures

USC Parallel and Distributed Processing Center 8 Slip Control Queue The Slip Control Queue (SCQ) adapts dynamically Late prefetches = prefetched data arrived after load had been issued Useful prefetches = prefetched data arrived before load had been issued if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2*late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ;

USC Parallel and Distributed Processing Center 9 Decoupling Programs for HiDISC (Discrete Convolution - Inner Loop) for (j = 0; j < i; ++j) y[i]=y[i]+(x[j]*h[i-j-1]); while (not EOD) y = y + (x * h); send y to SDQ for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); GET_SCQ; } send (EOD token) send address of y[i] to SAQ for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Inner Loop Convolution Computation Processor Code A ccess Processor Code Cache Management Code SAQ: Store Address Queue SDQ: Store Data Queue SCQ: Slip Control Queue EOD: End of Data

USC Parallel and Distributed Processing Center 10 Benchmarks Source of Benchmark Lines of Source Code Description Data Set Size LLL1 Livermore Loops [45] 20 1024-element arrays, 100 iterations 24 KB LLL2 Livermore Loops 24 1024-element arrays, 100 iterations 16 KB LLL3 Livermore Loops 18 1024-element arrays, 100 iterations 16 KB LLL4 Livermore Loops 25 1024-element arrays, 100 iterations 16 KB LLL5 Livermore Loops 17 1024-element arrays, 100 iterations 24 KB Tomcatv SPECfp95 [68] 190 33x33-element matrices, 5 iterations <64 KB MXM NAS kernels [5] 113 Unrolled matrix multiply, 2 iterations 448 KB CHOLSKY NAS kernels 156 Cholsky matrix decomposition 724 KB VPENTA NAS kernels 199 Invert three pentadiagonals simultaneously 128 KB Qsort Quicksort sorting algorithm [14] 58 Quicksort 128 KB

USC Parallel and Distributed Processing Center 11 ParameterValueParameterValue L1 cache size4 KBL2 cache size16 KB L1 cache associativity2L2 cache associativity2 L1 cache block size32 BL2 cache block size32 B Memory LatencyVariable, (0-200 cycles)Memory contention time Variable Victim cache size32 entriesPrefetch buffer size8 entries Load queue size128Store address queue size 128 Store data queue size128Total issue width8 Simulation

USC Parallel and Distributed Processing Center 12 Simulation Results

USC Parallel and Distributed Processing Center 13 Accomplishments 2x speedup for scientific benchmarks with large data sets over an in- order superscalar processor 7.4x speedup for matrix multiply (MXM) over an in-order issue superscalar processor - (similar operations are used in ATR/SLD) 2.6x speedup for matrix decomposition/substitution (Cholsky) over an in-order issue superscalar processor Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) Allows the compiler to solve indexing functions for irregular applications Reduced system cost for high-throughput scientific codes

USC Parallel and Distributed Processing Center 14 Work in Progress Compiler design Data Intensive Systems (DIS) benchmarks analysis Simulator update Parameterization of silicon space for VLSI implementation

USC Parallel and Distributed Processing Center 15 Compiler Requirements Source language flexibility Sequential assembly code for streaming Ease of implementation Optimality of sequential code Source level language flexibility Portability Ease of implementation Portability and upgradability

USC Parallel and Distributed Processing Center 16 Gcc-2.95 Features Localized register spilling, global common sub expression elimination using lazy code motion algorithms There is also an enhancement made in the control flow graph analysis.The new framework simplifies control dependence analysis, which is used by aggressive dead code elimination algorithms +Provision to add modules for instruction scheduling and delayed branch execution +Front-ends for C, C++ and Fortran available +Support for different environments and platforms +Cross compilation

USC Parallel and Distributed Processing Center 17 Compiler Organization HiDISC Compilation Overview GCC Source Program Assembly Code Stream Separator Computational Assembly Code Access Assembly Code Cache Management Assembly Code Computation Assembly Code Access Assembly Code Cache Management Object Code Assembler

USC Parallel and Distributed Processing Center 18 HiDISC Stream Separator Sequential Source Program Flow Graph Classify Address Registers Allocate Instruction to streams Access Stream Access Stream Computation Stream Computation Stream Fix Conditional Statements Move Queue Access into Instructions Move Loop Invariants out of the loop Add Slip Control Queue Instructions Substitute Prefetches for Loads, Remove global Stores, and Reverse SCQ Direction Substitute Prefetches for Loads, Remove global Stores, and Reverse SCQ Direction Add global data Communication and Synchronization Produce Assembly code Computation Assembly Code Computation Assembly Code Access Assembly Code Access Assembly Code Cache Management Assembly Code Cache Management Assembly Code Current Work Future Work

USC Parallel and Distributed Processing Center 19 Compiler Front End Optimizations Jump Optimization: simplify jumps to the following instruction, jumps across jumps and jumps to jumps Jump Threading: detect a conditional jump that branches to an identical or inverse test Delayed Branch Execution: find instructions that can go into the delay slots of other instructions Constant Propagation: Propagate constants into a conditional loop

USC Parallel and Distributed Processing Center 20 Compiler Front End Optimizations (contd.) Instruction Combination: combine groups of two or three instructions that are related by data flow into a single instruction Instruction Scheduling: looks for instructions whose output will not be available by the time that it is used in subsequent instructions Loop Optimizations: move constant expressions out of loops, and do strength- reduction

USC Parallel and Distributed Processing Center 21 Example of Stressmarks Pointer Stressmark Basic idea: repeatedly follow pointers to randomized locations in memory Memory access pattern is unpredictable Randomized memory access pattern: –Not sufficient temporal and spatial locality for conventional cache architectures HiDISC architecture provides lower memory access latency

USC Parallel and Distributed Processing Center 22 Decoupling of Pointer Stressmarks for (i=j+1;i<w;i++) { if (field[index+i] > partition) balance++; } if (balance+high == w/2) break; else if (balance+high > w/2) { min = partition; } else { max = partition; high++; } while (not EOD) if (field > partition) balance++; if (balance+high == w/2) break; else if (balance+high > w/2) { min = partition; } else { max = partition; high++; } for (i=j+1; i<w; i++) { load (field[index+i]); GET_SCQ; } send (EOD token) for (i=j+1; i<w; i++) { prefetch (field[index+i]); PUT_SCQ; } Inner loop for the next indexing Computation Processor Code Access Processor Code Cache Management Code

USC Parallel and Distributed Processing Center 23 Stressmarks Hand-compile the 7 individual benchmarks Use gcc as front-end Manually partition each of the three instruction streams and insert synchronizing instructions Evaluate architectural trade-offs Updated simulator characteristics such as out- of-order issue Large L2 cache and enhanced main memory system such as Rambus and DDR

USC Parallel and Distributed Processing Center 24 Simulator Update Survey the current processor architecture Focus on commercial leading edge technology for implementation Analyze the current simulator and previous benchmark results Enhance memory hierarchy configurations Add Out-of-Order issue

USC Parallel and Distributed Processing Center 25 Memory Hierarchy Current modern processors have increasingly large L2 on-chip caches E.g., 256 KB L-2 cache on Pentium and Athlon processor reduces L1 cache miss penalty Also, development of new mechanisms in the architecture of the main memory (e.g., RAMBUS) reduces the L2 cache miss penalty

USC Parallel and Distributed Processing Center 26 Out-of-Order multiple issue Most of the current advanced processors are based on the Superscalar and Multiple Issue paradigm. MIPS-R10000, Power-PC, Ultra-Sparc, Alpha and Pentium family Compare HiDISC architecture and modern superscalar processors Out-of-Order instruction issue For precision exception handling, include in- order completion New access decoupling paradigm for out-of-order issue

USC Parallel and Distributed Processing Center 27 HiDISC with Modern DRAM Architecture RAMBUS and DDR DRAM improve the memory bandwidth Latency does not improve significantly Decoupled access processor can fully utilize the enhanced memory bandwidth More requests caused by access processor Pre-fetching mechanism hide memory access latency

USC Parallel and Distributed Processing Center 28 HiDISC / SMT Reduced memory latency of HiDISC can decrease the number of threads for SMT architecture relieve memory burden of SMT architecture lessen complex issue logic of multithreading The functional unit utilization can increase with multithreading features on HiDISC More instruction level parallelism is possible

USC Parallel and Distributed Processing Center 29 The McDISC System: Memory-Centered Distributed Instruction Set Computer

USC Parallel and Distributed Processing Center 30 Summary Designing a compiler Porting gcc to HiDISC Benchmark simulation with new parameters and updated simulator Analysis of architectural trade-offs for equal silicon area Hand-compilation of Stressmarks suites and simulation DIS benchmarks simulation

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA.

Similar presentations

Presentation on theme: "HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA.

Similar presentations

Presentation on theme: "HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA."— Presentation transcript:

Similar presentations

About project

Feedback