HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA.

Slides:

Advertisements

Similar presentations

The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002.

Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Computer Organization and Architecture

1 Optimizing compilers Managing Cache Bercovici Sivan.

CSCI 4717/5717 Computer Architecture

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Multiscalar processors

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot DARPA DIS PI Meeting Santa Fe,

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

Architecture Basics ECE 454 Computer Systems Programming

Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.

Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.

TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Pipelining and Parallelism Mark Staveley

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

CS 258 Spring The Expandable Split Window Paradigm for Exploiting Fine- Grain Parallelism Manoj Franklin and Gurindar S. Sohi Presented by Allen.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

The CAE Architecture: Decoupled Program Control for Energy-Efficient Performance Ronny Krashinsky and Michael Sung Change in project direction from original.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

1 Lecture 5a: CPU architecture 101 boris.

4- Performance Analysis of Parallel Programs

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

/ Computer Architecture and Design

Vector Processing => Multimedia

Spare Register Aware Prefetching for Graph Algorithms on GPUs

The Memory Gap: to Tolerate or to Reduce?

Lecture 14: Reducing Cache Misses

15-740/ Computer Architecture Lecture 14: Prefetching

Cache Performance Improvements

Presentation transcript:

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA University of Southern California October 12 th, 2000

USC Parallel and Distributed Processing Center 2 HiDISC: Hierarchical Decoupled Instruction Set Computer New Ideas A dedicated processor for each level of the memory hierarchy Explicitly manage each level of the memory hierarchy using instructions generated by the compiler Hide memory latency by converting data access predictability to data access locality Exploit instruction-level parallelism without extensive scheduling hardware Zero overhead prefetches for maximal computation throughput Impact 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor 7.4x speedup for matrix multiply over an in-order issue superscalar processor 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) Allows the compiler to solve indexing functions for irregular applications Reduced system cost for high-throughput scientific codes Schedule April 98 Start April 99April 00 Defined benchmarks Completed simulator Performed instruction-level simulations on hand-compiled benchmarks Continue simulations of more benchmarks (SAR) Define HiDISC architecture Benchmark result Develop and test a full decoupling compiler Update Simulator Generate performance statistics and evaluate design Dynamic Database Memory Cache Registers Application (FLIR SAR VIDEO ATR /SLD Scientific ) Application (FLIR SAR VIDEO ATR /SLD Scientific ) Processor Decoupling Compiler HiDISC Processor Sensor Inputs Processor Situational Awareness

USC Parallel and Distributed Processing Center 3 HiDISC: Hierarchical Decoupled Instruction Set Computer Dynamic Database Memory Cache Registers Application (FLIR SAR VIDEO ATR /SLD Scientific ) Application (FLIR SAR VIDEO ATR /SLD Scientific ) Processor Decoupling Compiler HiDISC Processor Sensor Inputs Processor Situational Awareness Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994] Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading

USC Parallel and Distributed Processing Center 4 Present Solutions Solution Larger Caches Hardware Prefetching Software Prefetching Multithreading Limitations —Slow —Works well only if working set fits cache and there is temporal locality. —Cannot be tailored for each application —Behavior based on past and present execution-time behavior —Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching —Adaptive software prefetching is required to change prefetch distance during run-time —Hard to insert prefetches for irregular access patterns —Solves the throughput problem, not the memory latency problem

USC Parallel and Distributed Processing Center 5 Observation: Software prefetching impacts compute performance PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching The HiDISC Approach Approach: Add a processor to manage prefetching -> hide overhead Compiler explicitly manages the memory hierarchy Prefetch distance adapts to the program runtime behavior

USC Parallel and Distributed Processing Center 6 Cache What is HiDISC? A dedicated processor for each level of the memory hierarchy Explicitly manage each level of the memory hierarchy using instructions generated by the compiler Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch) Exploit instruction-level parallelism without extensive scheduling hardware Zero overhead prefetches for maximal computation throughput Access Instructions Computation Instructions Access Processor (AP) Access Processor (AP) Computation Processor (CP) Computation Processor (CP) Registers Cache Mgmt. Processor (CMP) Cache Mgmt. Processor (CMP) Cache Mgmt. Instructions CompilerProgram 2nd-Level Cache and Main Memory

USC Parallel and Distributed Processing Center 7 MIPS DEAP CAPPHiDISC (Conventional)(Decoupled)(New Decoupled) 2nd-Level Cache and Main Memory Registers 5-issue Cache Access Processor (AP) - (3-issue) Access Processor (AP) - (3-issue) 2-issue Cache Mgmt. Processor (CMP) Registers Access Processor (AP) - (5-issue) Access Processor (AP) - (5-issue) Computation Processor (CP) Computation Processor (CP) 3-issue 2nd-Level Cache and Main Memory Registers 8-issue Registers 3-issue Cache Mgmt. Processor (CMP) DEAP: [Kurian, Hulina, & Caraor ‘94] PIPE: [Goodman ‘85] Other Decoupled Processors: ACRI, ZS-1, WA Cache 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Cache Decoupled Architectures

USC Parallel and Distributed Processing Center 8 Slip Control Queue The Slip Control Queue (SCQ) adapts dynamically Late prefetches = prefetched data arrived after load had been issued Useful prefetches = prefetched data arrived before load had been issued if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2*late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ;

USC Parallel and Distributed Processing Center 9 Decoupling Programs for HiDISC (Discrete Convolution - Inner Loop) for (j = 0; j < i; ++j) y[i]=y[i]+(x[j]*h[i-j-1]); while (not EOD) y = y + (x * h); send y to SDQ for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); GET_SCQ; } send (EOD token) send address of y[i] to SAQ for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Inner Loop Convolution Computation Processor Code A ccess Processor Code Cache Management Code SAQ: Store Address Queue SDQ: Store Data Queue SCQ: Slip Control Queue EOD: End of Data

USC Parallel and Distributed Processing Center 10 Benchmarks Source of Benchmark Lines of Source Code Description Data Set Size LLL1 Livermore Loops [45] element arrays, 100 iterations 24 KB LLL2 Livermore Loops element arrays, 100 iterations 16 KB LLL3 Livermore Loops element arrays, 100 iterations 16 KB LLL4 Livermore Loops element arrays, 100 iterations 16 KB LLL5 Livermore Loops element arrays, 100 iterations 24 KB Tomcatv SPECfp95 [68] x33-element matrices, 5 iterations <64 KB MXM NAS kernels [5] 113 Unrolled matrix multiply, 2 iterations 448 KB CHOLSKY NAS kernels 156 Cholsky matrix decomposition 724 KB VPENTA NAS kernels 199 Invert three pentadiagonals simultaneously 128 KB Qsort Quicksort sorting algorithm [14] 58 Quicksort 128 KB

USC Parallel and Distributed Processing Center 11 ParameterValueParameterValue L1 cache size4 KBL2 cache size16 KB L1 cache associativity2L2 cache associativity2 L1 cache block size32 BL2 cache block size32 B Memory LatencyVariable, (0-200 cycles)Memory contention time Variable Victim cache size32 entriesPrefetch buffer size8 entries Load queue size128Store address queue size 128 Store data queue size128Total issue width8 Simulation

USC Parallel and Distributed Processing Center 12 Simulation Results

USC Parallel and Distributed Processing Center 13 Accomplishments 2x speedup for scientific benchmarks with large data sets over an in- order superscalar processor 7.4x speedup for matrix multiply (MXM) over an in-order issue superscalar processor - (similar operations are used in ATR/SLD) 2.6x speedup for matrix decomposition/substitution (Cholsky) over an in-order issue superscalar processor Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) Allows the compiler to solve indexing functions for irregular applications Reduced system cost for high-throughput scientific codes

USC Parallel and Distributed Processing Center 14 Work in Progress Compiler design Data Intensive Systems (DIS) benchmarks analysis Simulator update Parameterization of silicon space for VLSI implementation

USC Parallel and Distributed Processing Center 15 Compiler Requirements Source language flexibility Sequential assembly code for streaming Ease of implementation Optimality of sequential code Source level language flexibility Portability Ease of implementation Portability and upgradability

USC Parallel and Distributed Processing Center 16 Gcc-2.95 Features Localized register spilling, global common sub expression elimination using lazy code motion algorithms There is also an enhancement made in the control flow graph analysis.The new framework simplifies control dependence analysis, which is used by aggressive dead code elimination algorithms +Provision to add modules for instruction scheduling and delayed branch execution +Front-ends for C, C++ and Fortran available +Support for different environments and platforms +Cross compilation

USC Parallel and Distributed Processing Center 17 Compiler Organization HiDISC Compilation Overview GCC Source Program Assembly Code Stream Separator Computational Assembly Code Access Assembly Code Cache Management Assembly Code Computation Assembly Code Access Assembly Code Cache Management Object Code Assembler

USC Parallel and Distributed Processing Center 18 HiDISC Stream Separator Sequential Source Program Flow Graph Classify Address Registers Allocate Instruction to streams Access Stream Access Stream Computation Stream Computation Stream Fix Conditional Statements Move Queue Access into Instructions Move Loop Invariants out of the loop Add Slip Control Queue Instructions Substitute Prefetches for Loads, Remove global Stores, and Reverse SCQ Direction Substitute Prefetches for Loads, Remove global Stores, and Reverse SCQ Direction Add global data Communication and Synchronization Produce Assembly code Computation Assembly Code Computation Assembly Code Access Assembly Code Access Assembly Code Cache Management Assembly Code Cache Management Assembly Code Current Work Future Work

USC Parallel and Distributed Processing Center 19 Compiler Front End Optimizations Jump Optimization: simplify jumps to the following instruction, jumps across jumps and jumps to jumps Jump Threading: detect a conditional jump that branches to an identical or inverse test Delayed Branch Execution: find instructions that can go into the delay slots of other instructions Constant Propagation: Propagate constants into a conditional loop

USC Parallel and Distributed Processing Center 20 Compiler Front End Optimizations (contd.) Instruction Combination: combine groups of two or three instructions that are related by data flow into a single instruction Instruction Scheduling: looks for instructions whose output will not be available by the time that it is used in subsequent instructions Loop Optimizations: move constant expressions out of loops, and do strength- reduction

USC Parallel and Distributed Processing Center 21 Example of Stressmarks Pointer Stressmark Basic idea: repeatedly follow pointers to randomized locations in memory Memory access pattern is unpredictable Randomized memory access pattern: –Not sufficient temporal and spatial locality for conventional cache architectures HiDISC architecture provides lower memory access latency

USC Parallel and Distributed Processing Center 22 Decoupling of Pointer Stressmarks for (i=j+1;i<w;i++) { if (field[index+i] > partition) balance++; } if (balance+high == w/2) break; else if (balance+high > w/2) { min = partition; } else { max = partition; high++; } while (not EOD) if (field > partition) balance++; if (balance+high == w/2) break; else if (balance+high > w/2) { min = partition; } else { max = partition; high++; } for (i=j+1; i<w; i++) { load (field[index+i]); GET_SCQ; } send (EOD token) for (i=j+1; i<w; i++) { prefetch (field[index+i]); PUT_SCQ; } Inner loop for the next indexing Computation Processor Code Access Processor Code Cache Management Code

USC Parallel and Distributed Processing Center 23 Stressmarks Hand-compile the 7 individual benchmarks Use gcc as front-end Manually partition each of the three instruction streams and insert synchronizing instructions Evaluate architectural trade-offs Updated simulator characteristics such as out- of-order issue Large L2 cache and enhanced main memory system such as Rambus and DDR

USC Parallel and Distributed Processing Center 24 Simulator Update Survey the current processor architecture Focus on commercial leading edge technology for implementation Analyze the current simulator and previous benchmark results Enhance memory hierarchy configurations Add Out-of-Order issue

USC Parallel and Distributed Processing Center 25 Memory Hierarchy Current modern processors have increasingly large L2 on-chip caches E.g., 256 KB L-2 cache on Pentium and Athlon processor reduces L1 cache miss penalty Also, development of new mechanisms in the architecture of the main memory (e.g., RAMBUS) reduces the L2 cache miss penalty

USC Parallel and Distributed Processing Center 26 Out-of-Order multiple issue Most of the current advanced processors are based on the Superscalar and Multiple Issue paradigm. MIPS-R10000, Power-PC, Ultra-Sparc, Alpha and Pentium family Compare HiDISC architecture and modern superscalar processors Out-of-Order instruction issue For precision exception handling, include in- order completion New access decoupling paradigm for out-of-order issue

USC Parallel and Distributed Processing Center 27 HiDISC with Modern DRAM Architecture RAMBUS and DDR DRAM improve the memory bandwidth Latency does not improve significantly Decoupled access processor can fully utilize the enhanced memory bandwidth More requests caused by access processor Pre-fetching mechanism hide memory access latency

USC Parallel and Distributed Processing Center 28 HiDISC / SMT Reduced memory latency of HiDISC can decrease the number of threads for SMT architecture relieve memory burden of SMT architecture lessen complex issue logic of multithreading The functional unit utilization can increase with multithreading features on HiDISC More instruction level parallelism is possible

USC Parallel and Distributed Processing Center 29 The McDISC System: Memory-Centered Distributed Instruction Set Computer

USC Parallel and Distributed Processing Center 30 Summary Designing a compiler Porting gcc to HiDISC Benchmark simulation with new parameters and updated simulator Analysis of architectural trade-offs for equal silicon area Hand-compilation of Stressmarks suites and simulation DIS benchmarks simulation