The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002.

Slides:



Advertisements
Similar presentations
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Instruction Level Parallelism (ILP) Colin Stevens.
HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA.
How Multi-threading can increase on-chip parallelism
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot DARPA DIS PI Meeting Santa Fe,
HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot USC UNIVERSITY OF SOUTHERN CALIFORNIA.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Pipelining and Parallelism Mark Staveley
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
COMP 740: Computer Architecture and Implementation
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Simultaneous Multithreading
Simultaneous Multithreading
Computer Structure Multi-Threading
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
The Memory Gap: to Tolerate or to Reduce?
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Lecture 14: Reducing Cache Misses
Yingmin Li Ting Yan Qi Zhao
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
The Vector-Thread Architecture
Levels of Parallelism within a Single Processor
How to improve (decrease) CPI
Cache Performance Improvements
Presentation transcript:

The Memory Gap: to Tolerate or to Reduce? Jean-Luc Gaudiot Professor University of California, Irvine April 2 nd, 2002

Outline n The problem: the Memory Gap n Simultaneous Multithreading n Decoupled Architectures n Memory Technology n Processor-In-Memory

The Memory Latency Problem n Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) n Problem: Memory Latency - Conventional Memory Hierarchy Insufficient: Many applications have large data sets that are accessed non-contiguously. Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994]. n Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs

Some Solutions Solution Larger Caches Hardware Prefetching Software Prefetching Multithreading Limitations Slow Works well only if working set fits cache and there is temporal locality. Cannot be tailored for each application Behavior based on past and present execution-time behavior Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching Adaptive software prefetching is required to change prefetch distance during run-time Hard to insert prefetches for irregular access patterns Solves the throughput problem, not the memory latency problem

Limitation of Present Solutions n Huge cache: Slow and works well only if the working set fits cache and there is some kind of locality n Prefetching Hardware prefetching –Cannot be tailored for each application –Behavior based on past and present execution-time behavior Software prefetching –Ensure overheads of prefetching do not outweigh the benefits –Hard to insert prefetches for irregular access patterns n SMT Enhance the utilization and throughput at thread level

Outline n The problem: the memory gap n Simultaneous Multithreading n Decoupled Architectures n Memory Technology n Processor-In-Memory

Simultaneous Multi-Threading (SMT) Horizontal and vertical sharing Hardware support of multiple threads Functional resources shared by multiple threads Shared caches Highest utilization with multi-program or parallel workload

SMT Compared to SS Superscalar processors execute multiple instructions per cycle Superscalar functional units idle due to I-fetch stalls, conditional branches, data dependencies SMT dispatches instructions from multiple data streams, allowing efficient execution and latency tolerance Vertical sharing (TLP and block multi-threading) Horizontal sharing (ILP and simultaneous multiple thread instruction dispatch)

CMP Compared to SS CMP uses thread-level parallelism to increase throughput CMP has layout efficiency More functional units Faster clock rate CMP hardware partition limits performance Smaller level-1 resources cause increased miss rates Execution resources not available from across partition

Wide Issue SS Inefficiencies Architecture and software limitations Limited program ILP => idle functional units Increased waste of speculative execution Technology issues Area grows O((d 3 ) {d = issue or dispatch width} Area grows an additional O(tLog 2 (t)) {t= #SMT threads} Increased wire delays (increased area, tighter spacings, thinner oxides, thinner metal) Increased memory access delays versus processor clock Larger pipeline penalties Problems solved through: CMP - localizes processor resources SMT - efficient use of FUs, latency tolerance Both CMP and SMT - thread level parallelism

POSM Configurations All architectures above have eight threads Which configuration has the highest performance for an average workload? Run benchmarks on various configurations, find optimal performance point

Superscalar, SMT, CMP, and POSM Processors CMP and SMT both have higher throughput than superscalar Combination of CMP/SMT has highest throughput Experiment results

Equivalent Functional Units SMT.p1 has highest performance through vertical and horizontal sharing cmp.p8 has linear increase in performance

Equivalent Silicon Area and System Clock Effects SMT.p1 throughput is limited SMT.p1 and POSM.p2 have equivalent single thread performance POSM.p4 and CMP.p8 have highest throughput

Synthesis Comparable silicon resources are required for processor evaluation POSM.p4 has 56% more throughput than wide-issue SMT.p1 Future wide-issue processors are difficult to implement, increasing the POSM advantage Smaller technology spacings have higher routing delays due to parasitic resistance and capacitance The larger the processor, the larger the O(d 2 tLog 2 (t)) and O(d 3 t) impact on area and delays SMT works well with deep pipelines The ISA and micro-architecture affect SMT overhead 4-thread x86 SMT would have 1/8th the SMT overhead Layout and micro-architecture techniques reduces SMT overhead

Outline n The problem: the memory gap n Simultaneous Multithreading n Decoupled Architectures n Memory Technology n Processor-In-Memory

Observation: Software prefetching impacts compute performance PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching The HiDISC Approach Approach: Add a processor to manage prefetching -> hide overhead Compiler explicitly manages the memory hierarchy Prefetch distance adapts to the program runtime behavior

MIPS DEAP CAPPHiDISC (Conventional)(Decoupled)(New Decoupled) 2nd-Level Cache and Main Memory Registers 5-issue Cache Access Processor (AP) - (3-issue) Access Processor (AP) - (3-issue) 2-issue Cache Mgmt. Processor (CMP) Registers Access Processor (AP) - (5-issue) Access Processor (AP) - (5-issue) Computation Processor (CP) Computation Processor (CP) 3-issue 2nd-Level Cache and Main Memory Registers 8-issue Registers 3-issue Cache Mgmt. Processor (CMP) DEAP: [Kurian, Hulina, & Caraor 94] PIPE: [Goodman 85] Other Decoupled Processors: ACRI, ZS-1, WA Cache 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory 2nd-Level Cache and Main Memory Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Computation Processor (CP) Cache Decoupled Architectures

What is HiDISC? A dedicated processor for each level of the memory hierarchy n Explicitly manage each level of the memory hierarchy using instructions generated by the compiler n Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch) n Exploit instruction-level parallelism without extensive scheduling hardware n Zero overhead prefetches for maximal computation throughput HiDISC Access Processor (AP) Access Processor (AP) 2-issue Cache Mgmt. Processor (CMP) Registers 3-issue L2 Cache and Higher Level Computation Processor (CP) Computation Processor (CP) L1 Cache 3-issue Slip Control Queue Store Data Queue Store Address Queue Load Data Queue

Slip Control Queue n The Slip Control Queue (SCQ) adapts dynamically Late prefetches = prefetched data arrived after load had been issued Useful prefetches = prefetched data arrived before load had been issued if (prefetch_buffer_full ()) Dont change size of SCQ; else if ((2*late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ;

Decoupling Programs for HiDISC (Discrete Convolution - Inner Loop) for (j = 0; j < i; ++j) y[i]=y[i]+(x[j]*h[i-j-1]); while (not EOD) y = y + (x * h); send y to SDQ for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); GET_SCQ; } send (EOD token) send address of y[i] to SAQ for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Inner Loop Convolution Computation Processor Code A ccess Processor Code Cache Management Code SAQ: Store Address Queue SDQ: Store Data Queue SCQ: Slip Control Queue EOD: End of Data

Benchmarks Source of Benchmark Lines of Source Code Description Data Set Size LLL1 Livermore Loops [45] element arrays, 100 iterations 24 KB LLL2 Livermore Loops element arrays, 100 iterations 16 KB LLL3 Livermore Loops element arrays, 100 iterations 16 KB LLL4 Livermore Loops element arrays, 100 iterations 16 KB LLL5 Livermore Loops element arrays, 100 iterations 24 KB Tomcatv SPECfp95 [68] x33-element matrices, 5 iterations <64 KB MXM NAS kernels [5] 113 Unrolled matrix multiply, 2 iterations 448 KB CHOLSKY NAS kernels 156 Cholesky matrix decomposition 724 KB VPENTA NAS kernels 199 Invert three pentadiagonals simultaneously 128 KB Qsort Quicksort sorting algorithm [14] 58 Quicksort 128 KB

ParameterValueParameterValue L1 cache size4 KBL2 cache size16 KB L1 cache associativity2L2 cache associativity2 L1 cache block size32 BL2 cache block size32 B Memory LatencyVariable, (0-200 cycles)Memory contention time Variable Victim cache size32 entriesPrefetch buffer size8 entries Load queue size128Store address queue size 128 Store data queue size128Total issue width8 Simulation Parameters

Simulation Results

VLSI Layout Overhead (I) n Goal: Cost effectiveness of HiDISC architecture n Cache has become a major portion of the chip area n Methodology: Extrapolated HiDISC VLSI Layout based on MIPS10000 processor (0.35 μm, 1996) n The space overhead of HiDISC is extrapolated to be 11.3% more than a comparable MIPS processor n The benchmark should be run again using these parameters and new memory architectures

VLSI Layout Overhead (II) ComponentOriginal MIPS R10K(0.35 m) Extrapolation (0.15 m) HiDISC (0.15 m) D-Cache (32KB)26 mm mm 2 I-Cache (32KB)28 mm 2 7 mm 2 14 mm 2 TLB Part10 mm mm 2 External Interface Unit27 mm mm 2 Instruction Fetch Unit and BTB18 mm mm mm 2 Instruction Decode Section21 mm mm 2 Instruction Queue28 mm 2 7 mm 2 0 mm 2 Reorder Buffer17 mm mm 2 0 mm 2 Integer Functional Unit20 mm 2 5 mm 2 15 mm 2 FP Functional Units24 mm 2 6 mm 2 Clocking & Overhead73 mm mm 2 Total Size without L2 Cache292 mm mm mm 2 Total Size with on chip L2 Cache129.2 mm mm 2

The Flexi-DISC n Fundamental characteristics: inherently highly dynamic at execution time. n Dynamic reconfigurable central computational kernel (CK) n Multiple levels of caching and processing around CK adjustable prefetching n Multiple processors on a chip which will provide for a flexible adaptation from multiple to single processors and horizontal sharing of the existing resources.

The Flexi-DISC n Partitioning of Computation Kernel It can be allocated to the different portions of the application or different applications n CK requires separation of the next ring to feed it with data n The variety of target applications makes the memory accesses unpredictable n Identical processing units for outer rings Highly efficient dynamic partitioning of the resources and their run-time allocation can be achieved

Multiple HiDISC: McDISC n Problem: All extant, large-scale multiprocessors perform poorly when faced with a tightly-coupled parallel program. n Reason: Extant machines have a long latency when communication is needed between nodes. This long latency kills performance when executing tightly-coupled programs. (Note that multi-threading à la Tera does not help when there are dependencies.) n The McDISC solution: Provide the network interface processor (NIP) with a programmable processor to execute not only OS code (e.g. Stanford Flash), but user code, generated by the compiler. n Advantage: The NIP, executing user code, fetches data before it is needed by the node processors, eliminating the network fetch latency most of the time. n Result: Fast execution (speedup) of tightly-coupled parallel programs.

The McDISC System: Memory-Centered Distributed Instruction Set Computer

Summary n A processor for each level of the memory hierarchy n Adaptive memory hierarchy management n Reduces memory latency for systems with high memory bandwidths (PIMs, RAMBUS) n 2x speedup for scientific benchmarks n 3x speedup for matrix decomposition/substitution (Cholesky) n 7x speedup for matrix multiply (MXM) (similar results expected for ATR/SLD)

Outline n The problem: the memory gap n Simultaneous Multithreading n Decoupled Architectures n Memory Technology n Processor-In-Memory

Memory Technology n New DRAM technologies DDR DRAM, SLDRAM and DRDRAM Most DRAM technologies achieve higher bandwidth n Integrating memory and processor on a single chip (PIM and IRAM) Bandwidth and memory access latency sharply improve

New Memory Technologies (Cont.) n Rambus DRAM (RDRAM) memory interleaving system integrated onto a single memory chip Four outstanding requests with pipelined micro architecture Operates at much higher frequencies than SDRAM n Direct Rambus DRAM (DRDRAM) Direct control of all row and column resources concurrently with data transfer operations Current DRDRAM can achieve 1.6 Gbytes/sec bandwidth transferring on both clock edges

Intelligent RAM (IRAM) n Merging technology of processor and memory n All the memory accesses remain within a single chip Bandwidth can be as high as 100 to 200 Gbytes/sec Access latency is less than 20ns n Good solution for data intensive streaming application

Vector IRAM n Cost effective system Incorporates vector processing units and the memory system on a single chip n Beneficial for the multimedia application with critical DSP features n Good energy efficiency n Attractive for future mobile computing processors

Outline n The problem: the memory gap n Simultaneous Multithreading n Decoupled Architectures n Memory Technology n Processor-In-Memory

Overview of the System n Proposed DCS (Data-intensive Computing System) Architecture

DCS System (Contd) n Programming Different from the conventional programming model Applications are divided into two separate sections –Software : Executed by the host processor –Hardware : Executed by the CMP The programmer must use CMP instructions n CMP Several CMPs can be connected to the system bus Variable CMP size and configuration depending on the amount and complexity of job it has to handle Variable size, function and location of logics inside of CMP to better handle the application. n Memory, Coprocessors, I/O

CMP Architecture n CMP (Computational Memory Processor) Architecture The Heart of our work Responsible for executing the core operation of data- intensive applications Attached to the system bus CMP instructions are encapsulated in the normal memory operations. Consists of many ACME (Application-specific Computational Memory Element) cells interconnected amongst themselves through dedicated communication links n CMC(Computing Memory Cluster) A small number of ACME cells are put together to form a CMC The Network for connecting the CMCs are separate from the memory decoder

CMP Architecture

CMC Architecture

ACME Architecture n ACME (Application-specific Computational Memory Elements) Architecture ACME-memory, configuration cache, CE (Computing Element), FSM CE is the reconfigurable computing unit and consists of many CC (Computing Cells) FSM govern the overall execution of the ACME

Inside the Computing Elements

Synchronization and Interface n Three different kinds of communications Host processor with CMP (eventually with each ACME) –Done by synchronization variables (specific memory locations) located inside the memory of each ACME cells –Example : start and end signals for operations. CMP instructions for each ACME ACME to ACME –Two different approaches Host mediated –Simple –Not practical for frequent communications Distributed mediated approach –Expensive and complex –Efficient CMP to CMP

Benefits of the Paradigm n All the benefits from being the PIM Increased bandwidth and Reduced latency Faster Computation –Parallel execution among many ACMEs n Effective usage of the full memory bandwidth n Efficient co-existence of Software and Hardware n More parallel execution inside of ACMEs by efficiently configuring the structure with considerations for applications n Scalability

Implementation of the CMP n Projected how our CMP will be implemented… According to 2000 edition of ITRS (International Technology Roadmap for Semiconductors), in year 2008 –A High-end MPU with billion transistors will be in production with 0.06um technology and 427mm 2 –If half of the die size is allocated to memory, 8.13 Gbits storage will be available and 690 million transistors for logic –There can be 2048 ACME cells with each 512Kbytes of memory and 315K transistors for logic, control, anything inside ACME and rest of resources (36M transistors) for interconnections inside.

Motion Estimation of MPEG n Finding the motion vectors for a macro block in the frame. n It absorbs about 70% of the total execution time of MPEG n Huge amount of simple addition, subtraction and comparisons

Example ME execution n One ACME structure to find a motion vector for a macro block Executes in pipelined fashion reusing the data

Example ME execution n Performance For a 8*8 macro block with 8 pixel displacement 276 clock cycles to find the motion vector for one macro block n Performance comparison with other architectures