Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

Slides:



Advertisements
Similar presentations
UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.
Advertisements

Virtual Cluster Scheduling Through the Scheduling Graph Josep M. Codina Jesús Sánchez Antonio González Intel Barcelona Research Center, Intel Labs - UPC.
ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.
U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Aleksandra Tešanović Low Power/Energy Scheduling for Real-Time Systems Aleksandra Tešanović Real-Time Systems Laboratory Department of Computer and Information.
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.
UPC Power and Complexity Aware Microarchitectures Jaume Abella 1 Ramon Canal 1
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
1 Clustered Data Cache Designs for VLIW Processors PhD Candidate: Enric Gibert Advisors: Antonio González, Jesús Sánchez.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs.
VOLTAGE SCHEDULING HEURISTIC for REAL-TIME TASK GRAPHS D. Roychowdhury, I. Koren, C. M. Krishna University of Massachusetts, Amherst Y.-H. Lee Arizona.
A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.
Low Power Techniques in Processor Design
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.
Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.
Multimedia Computing and Networking Jan Reduced Energy Decoding of MPEG Streams Malena Mesarina, HP Labs/UCLA CS Dept Yoshio Turner, HP Labs.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
D A C U C P Speculative Alias Analysis for Executable Code Manel Fernández and Roger Espasa Computer Architecture Department Universitat Politècnica de.
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.
A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.
Embedded Real-Time Systems
Memory Segmentation to Exploit Sleep Mode Operation
Design-Space Exploration
‘99 ACM/IEEE International Symposium on Computer Architecture
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Michael Chu, Kevin Fan, Scott Mahlke
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Dynamic Code Mapping Techniques for Limited Local Memory Systems
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Presentation transcript:

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González 1,2 1 Intel Barcelona Research Center Intel Labs, Barcelona 2 Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya, Barcelona

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)2 Issue #1: Energy Consumption First class design goal Heterogeneity –↓ supply voltage and/or ↑ threshold voltage Cache memory  ARM10 –D-cache  24% dynamic energy –I-cache  22% dynamic energy Heterogeneity can be exploited in the D-cache for VLIW processors processor front-end processor back-end processor front-end processor back-end Higher performance Higher energy Lower performance Lower energy

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)3 Issue #2: Wire Delays From capacity-bound to communication-bound One possible solution: clustering Unified cache clustered VLIW processor –Used as a baseline throughout this work CLUSTER 1 Reg. File FUs Global communication buses Cache Memory buses … CLUSTER 2 Reg. File FUs CLUSTER n Reg. File FUs

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)4 Contributions GOAL : exploit heterogeneity in the L1 D-cache for clustered VLIW processors Power-efficient distributed L1 data cache –Divide data cache into two modules and assign each to a cluster Modules may be heterogeneous –Map variables statically between cache modules –Develop instruction scheduling techniques Results summary –Heterogeneous distributed data cache  good design point –Distributed data cache vs. unified data cache Distributed caches outperform unified schemes in EDD and ED –No single distributed cache configuration is the best Reconfigurable distributed cache  allows additional improvements

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)5 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)6 L2 D-CACHE Register buses load *p RF CLUSTER 1 var X RF CLUSTER 2 var Y Variable-Based Multi-Module Cache FU RF CLUSTER 1 FIRST MODULE SECOND MODULE FU RF CLUSTER 2 Register buses L2 D-CACHE Memory instructions have a preferred cluster  cluster affinity “Wrong” cluster assignment  performance, not correctness  Resume execution  Stall clusters  Empty communication buses  Send request  Access memory  Send reply back load X STACK HEAP DATA GLOBAL DATA STACK HEAP DATA GLOBAL DATA FIRST SPACE SECOND SPACE SP1 SP2 distributed stack frames Logical Address Space

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)7 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)8 Distributed Cache Configurations 8KB FASTSLOW 1 R/W latency ↑ energy ↓ FAST FU+RF CLUSTER 1 FU+RF CLUSTER 2 FAST+NONE FAST FU+RF CLUSTER 1 FAST FU+RF CLUSTER 2 FAST+FAST SLOW FU+RF CLUSTER 1 FU+RF CLUSTER 2 SLOW+NONE SLOW FU+RF CLUSTER 1 SLOW FU+RF CLUSTER 2 SLOW+SLOW FAST FU+RF CLUSTER 1 SLOW FU+RF CLUSTER 2 FAST+SLOW FIRST MODULE FU RF CLUSTER 1 SECOND MODULE FU RF CLUSTER 2 Register buses L2 D-CACHE

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)9 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)10 Instructions-to-Variables Graph Built with profiling information Variables = global, local, heap LD1 LD2 ST1 LD3 ST2 LD4 LD5 VAR V1VAR V2VAR V3VAR V4 FIRSTSECOND CACHE FU+RF CLUSTER 1 CACHE FU+RF CLUSTER 2 LD2 LD1 LD4 LD5 ST1 LD3 ST2

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)11 Greedy Mapping / Scheduling Algorithm Initial mapping  all to space Assign affinities to instructions –Express a preferred cluster for memory instructions: [0,1] –Propagate affinities from memory insts. to other insts. Schedule code + refine mapping Compute IVG Compute mapping Compute affinities using IVG + propagate affinities Compute affinities using IVG + propagate affinities Schedule code

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)12 Computing and Propagating Affinity add1 add2 LD1 LD2 mul1 add6 add7 ST1 add3 add4 LD3 LD4 add5 L=1 L=3 LD1 LD2 LD3 LD4 ST1 V1 V2 V4 V3 FIRSTSECOND AFFINITY=0AFFINITY=1 FIRST MODULE FU RF CLUSTER 1 Register buses SECOND MODULE FU RF CLUSTER 2 AFF.=0.4 slack 0 slack 2 slack 0 slack 2 slack 0 slack 5

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)13 Cluster affinity + affinity range  used to: –Define a preferred cluster –Guide the instruction-to-cluster assignment process Strongly preferred cluster –Schedule instruction in that cluster Weakly preferred cluster –Schedule instruction where global comms. are minimized Cluster Assignment IBIB ICIC Affinity range (0.3, 0.7) ≤ 0.3≥ 0.7 CACHE FU+RF CLUSTER 1 CACHE FU+RF CLUSTER 2 V1 IAIA 100 Affinity=0 Affinity=0.9 V2V Affinity=0.4 ICIC ICIC ? IAIA IBIB

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)14 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)15 Evaluation Framework IMPACT compiler infrastructure +16 Mediabench Cache parameters –CACTI SIA projections + ARM10 datasheets Data cache consumes 1/3 of the processor energy Leakage accounts for 50% of the total energy Results outline –Distributed cache schemes: F+Ø, F+F, F+S, S+S, S+Ø Affinity range EDD and ED comparison  the lower, the better F+Ø used as baseline throughout presentation –Comparison with a unified cache scheme FAST and SLOW unified schemes State-of-the-art scheduling techniques for these schemes –Reconfigurable distributed cache 8KB FASTSLOW 1 R/W L = 2 1 R/W L = 4 latency x2 energy by 1/3

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)16 Affinity Range Affinity plays a key role in cluster assignment –36% - 44% better in EDD than no-affinity –32% better in ED than no-affinity (0,1) affinity range is the best –~92% of memory instructions access a single variable –Binary affinity for memory instructions NO AFFINITY FAST+FAST EDD FAST+SLOW EDD SLOW+SLOW EDD

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)17 EDD Results Memory Ports SensitiveInsensitive Memory LatencySensitiveFAST+FASTFAST+NONE InsensitiveSLOW+SLOWSLOW+NONE

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)18 ED Results

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)19 Comparison With Unified Cache BEST DISTRIBUTEDUNIFIED FASTUNIFIED SLOW EDD0.89 (FAST+SLOW) ED0.89 (SLOW+SLOW) Distributed schemes are better than unified schemes –29-31% better in EDD and 19-29% better in ED FUs RF CLUSTER 1 FAST CACHE FUs RF CLUSTER 2 FUs RF CLUSTER 1 SLOW CACHE FUs RF CLUSTER 2 Instruction Scheduling Aletà et al. (PACT’02)

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)20 Reconfigurable Distributed Cache The OS can set each module in one state: –FAST mode / SLOW mode / Turned-off The OS reconfigures the cache on a context switch –Depending on the applications scheduled in and scheduled out Two different V DD and V TH for the cache –Reconfiguration overhead: 1-2 cycles [Flautner et al. 2002] Simple heuristic to show potential –For each application, choose the estimated best cache configuration BEST DISTRIBUTED RECONFIGURABLE SCHEME EDD0.89 (FAST+SLOW) 0.86 ED0.89 (SLOW+SLOW) 0.86

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)21 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)22 Conclusions Distributed Variable-Based Multi-Module Cache –Affinity is crucial for achieving good performance 36-44% better in EDD and 32% in ED than no-affinity –Heterogeneity ( FAST+SLOW ) is a good design point 4-11% better in EDD and from 6% worse to 10% better in ED –No single cache configuration is the best Reconfigurable cache modules  exploit additional 3-4% Distributed schemes vs. unified schemes –All distributed schemes outperform unified ones 29-31% better in EDD, 19-29% better in ED

Q&A