Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.

Slides:

Advertisements

Similar presentations

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Advertisements

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS, ICT '09. TAREK OUNI WALID AYEDI MOHAMED ABID NATIONAL ENGINEERING SCHOOL OF SFAX New Low Complexity.

A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.

UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

ECE 510 Brendan Crowley Paper Review October 31, 2006.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Cache Physical Implementation Panayiotis Charalambous Xi Research Group Panayiotis Charalambous Xi Research Group.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.

Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.

Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.

Advanced Caches Smruti R. Sarangi.

5.2 Eleven Advanced Optimizations of Cache Performance

Hyperthreading Technology

Ka-Ming Keung Swamy D Ponpandi

ICIEV 2014 Dhaka, Bangladesh

CARP: Compression-Aware Replacement Policies

Cache - Optimization.

rePLay: A Hardware Framework for Dynamic Optimization

Ka-Ming Keung Swamy D Ponpandi

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University Houston, Texas Sarita Adve Dept. of Computer Science University of Illinois at Urbana Champaign Urbana, Illinois Norman P. Jouppi Western Research Laboratory Compaq Computer Corporation Palo Alto, California

Reconfigurable Caches -2- Partha Ranganathan Motivation (1 of 2) Different workloads on general-purpose processors Scientific/engineering, databases, media processing, … Widely different characteristics Challenge for future general-purpose systems Use most transistors effectively for all workloads

Reconfigurable Caches -3- Partha Ranganathan Motivation (2 of 2) Challenge for future general-purpose systems Use most transistors effectively for all workloads 50% to 80% of processor transistors devoted to cache Very effective for engineering and database workloads BUT large caches often ineffective for media workloads Streaming data and large working sets [ISCA 1999] Can we reuse cache transistors for other useful work?

Reconfigurable Caches -4- Partha Ranganathan Reconfigurable Caches Flexibility to reuse cache SRAM for other activities Several applications possible Simple organization and design changes Small impact on cache access time Contributions

Reconfigurable Caches -5- Partha Ranganathan Reconfigurable Caches Flexibility to reuse cache SRAM for other activities Several applications possible Simple organization and design changes Small impact on cache access time Application for media processing e.g., instruction reuse – reuse memory for computation 1.04X to 1.20X performance improvement Contributions

Reconfigurable Caches -6- Partha Ranganathan Outline for Talk Motivation Reconfigurable caches Key idea Organization Implementation and timing analysis Application for media processing Summary and future work

Reconfigurable Caches -7- Partha Ranganathan Reconfigurable Caches: Key Idea Dynamically divide SRAM into multiple partitions Use partitions for other useful activities On-chip SRAM Cache Partition A - cache Partition B - lookup Current use of on-chip SRAM Proposed use of on-chip SRAM  Cache SRAM useful for both conventional and media workloads Key idea: reuse cache transistors!

Reconfigurable Caches -8- Partha Ranganathan Reconfigurable Cache Uses Number of different uses for reconfigurable caches Optimizations using lookup tables to store patterns Instruction reuse, value prediction, address prediction, … Hardware and software prefetching Caching of prefetched lines Software-controlled memory QoS guarantees, scratch memory area  Cache SRAM useful for both conventional and media workloads

Reconfigurable Caches -9- Partha Ranganathan Key Challenges How to partition SRAM? How to address the different partitions as they change? Minimize impact on cache access (clock cycle) time On-chip SRAM Cache Partition A - cache Partition B - lookup Current use of on-chip SRAM Proposed use of on-chip SRAM  Associativity-based partitioning

Reconfigurable Caches -10- Partha Ranganathan Conventional Cache Organization TagIndex StateTagData Block Compare Select Data out Hit/miss Address Way 1 Way 2

Reconfigurable Caches -11- Partha Ranganathan Associativity-Based Partitioning Address TagIndex StateTagData Block Compare Select Data out Hit/miss Way 1 Way 2 Partition 1 Partition 2TagIndexBlock Choose Partition at granularity of “ways” Multiple data paths and additional state/logic

Reconfigurable Caches -12- Partha Ranganathan Reconfigurable Cache Organization Associativity-based partitioning Simple - small changes to conventional caches But # and granularity of partitions depends on associativity Alternate approach: Overlapped-wide-tag partitioning More general, but slightly more complex Details in paper

Reconfigurable Caches -13- Partha Ranganathan Other Organizational Choices (1 of 2) Ensuring consistency of data at repartitioning Cache scrubbing: flush data at repartitioning intervals Lazy transitioning: Augment state with partition information Addressing of partitions - software (ISA) vs. hardware On-chip SRAM Cache Partition A Partition B Current use of on-chip SRAM Proposed use of on-chip SRAM

Reconfigurable Caches -14- Partha Ranganathan Other Organizational Choices (2 of 2) Method of partitioning - hardware vs. software control Frequency of partitioning - frequent vs. infrequent Level of partitioning - L1, L2, or lower levels Tradeoffs based on application requirements On-chip SRAM Cache Partition A Partition B Current use of on-chip SRAM Proposed use of on-chip SRAM

Reconfigurable Caches -15- Partha Ranganathan Outline for Talk Motivation Reconfigurable caches Key idea Organization Implementation and timing analysis Application for media processing Summary and future work

Reconfigurable Caches -16- Partha Ranganathan Conventional Cache Implementation Tag and data arrays split into multiple sub-arrays to reduce/balance length of word lines and bit lines VALID OUTPUT TAG ARRAY DATA ARRAY ADDRESS DATA WORD LINES BIT LINES COLUMN MUXES SENSE AMPS COMPARATORS OUTPUT DRIVER MUX DRIVERS OUTPUT DRIVERS DECODERS

Reconfigurable Caches -17- Partha Ranganathan Associate sub-arrays with partitions Constraint on minimum number of sub-arrays Additional multiplexors, drivers, and wiring Changes for Reconfigurable Cache ADDRESS VALID OUTPUT TAG ARRAY DATA ARRAY DATA WORD LINES BIT LINES COLUMN MUXES SENSE AMPS COMPARATORS OUTPUT DRIVER MUX DRIVERS OUTPUT DRIVERS DECODERS [1:NP]

Reconfigurable Caches -18- Partha Ranganathan Impact on Cache Access Time Sub-array-based partitioning Multiple simultaneous accesses to SRAM array No additional data ports Timing analysis methodology CACTI analytical timing model for cache time (Compaq WRL) Extended to model reconfigurable caches Experiments varying cache sizes, partitions, technology, …

Reconfigurable Caches -19- Partha Ranganathan Impact on Cache Access Time Cache access time Comparable to base (within 1-4%) for few partitions (2) Higher for more partitions, especially with small caches But still within 6% for large caches Impact on clock frequency likely to be even lower

Reconfigurable Caches -20- Partha Ranganathan Outline for Talk Motivation Reconfigurable caches Application for media processing Instruction reuse with media processing Simulation results Summary and future work

Reconfigurable Caches -21- Partha Ranganathan Instruction reuse/memoization [Sodani and Sohi, ISCA 1997] Exploits value redundancy in programs Store instruction operands and result in reuse buffer If later instruction and operands match in reuse buffer, skip execution; read answer from reuse buffer Application for Media Processing cache partition cache partition cache partition Few changes for implementation with reconfigurable caches

Reconfigurable Caches -22- Partha Ranganathan Simulation Methodology Detailed simulation using RSIM (Rice) User-level execution-driven simulator Media processing benchmarks JPEG image encoding/decoding MPEG video encoding/decoding GSM speech decoding and MPEG audio decoding Speech recognition and synthesis

Reconfigurable Caches -23- Partha Ranganathan System Parameters Modern general-purpose processor with ILP+media extensions 1 GHz, 8-way issue, OOO, VIS, prefetching Multi-level memory hierarchy 128KB 4-way associative 2-cycle L1 data cache 1M 4-way associative 20-cycle L2 cache Simple reconfigurable cache organization 2 partitions at L1 data cache 64 KB data cache, 64KB instruction reuse buffer Partitioning at start of application in software

Reconfigurable Caches -24- Partha Ranganathan Impact of Instruction Reuse Performance improvements for all applications (1.04X to 1.20X) Use memory to reduce compute bottleneck Greater potential with aggressive design [details in paper] JPEG decodeMPEG decodeSpeech synthesis

Reconfigurable Caches -25- Partha Ranganathan Goal: Use cache transistors effectively for all workloads Reconfigurable Caches: Flexibility to reuse cache SRAM Simple organization and design changes Small impact on cache access time Several applications possible Instruction reuse - reuse memory for computation 1.04X to 1.20X performance improvement More aggressive reconfiguration currently under investigation Summary

Reconfigurable Caches -26- Partha Ranganathan More information available at