UPC Power and Complexity Aware Microarchitectures Jaume Abella 1 Ramon Canal 1

Slides:



Advertisements
Similar presentations
UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.
Advertisements

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.
U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
A scheme to overcome data hazards
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Very low power pipelines using significance compression Canal, R. Gonzalez, A. Smith, J.E. Dept. d'Arquitectura de Computadors, Univ. Politecnica de Catalunya,
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.
UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.
September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.
ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
My Coordinates Office EM G.27 contact time:
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.
SECTIONS 1-7 By Astha Chawla
Computer Structure Multi-Threading
Out of Order Processors
CS203 – Advanced Computer Architecture
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Out-of-Order Commit Processors
Flow Path Model of Superscalars
Hyperthreading Technology
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
The Microarchitecture of the Pentium 4 processor
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Out-of-Order Commit Processor
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Out-of-Order Commit Processors
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Prof. Onur Mutlu Carnegie Mellon University
Spring 2019 Prof. Eric Rotenberg
ECE 721, Spring 2019 Prof. Eric Rotenberg.
Dynamic Scheduling Physical Register File ready bits Issue Queue (IQ)
Presentation transcript:

UPC Power and Complexity Aware Microarchitectures Jaume Abella 1 Ramon Canal 1 Antonio González 1,2 1 Computer Architecture Dept. UPC-Barcelona 2 Intel Barcelona Research Center Intel Labs – UPC, Barcelona

UPC Issue Logic (I) Adaptative IQ Resize dynamically the ROB and issue queue according to their occupancy Dependence Based IQ Keep direct relationships between producer and consumer Prescheduling IQ Schedule instruction issue according to the latencies of functional units “Reducing the Complexity of the Issue Logic”, ICS 2001 “ A Low Complexity Issue Logic”, ICS 2000 “Power-Aware Adaptive Issue Queue and Register File”, HiPC 2003

UPC Issue Logic (II) FP distributed issue queue without CAM cells Dispatch –Instructions belonging to a dependence chain are sent to the same queue –Multiple dependence chains may share a queue Issue –Small table keeps track of how many cycles has to wait the first instruction of a chain to be issued –First, select the oldest instruction that will become ready next cycle. Second, the oldest ready instruction “Low-Complexity Distributed Issue Queue”, HPCA 2004

UPC Memory Hierarchy Heterogeneous L1 Dcache banks Slow cache Fast cache L2 cache Is critical? LOAD YESNO Adaptative L2 Cache Deactivate Cache Lines Current predictors are L1 cache oriented L1 and L2 behave quite different Use access counts and inter-access time to compute decay intervals L1 First access Hit Hit Hit Hit Replaced Hit Hit Hit L2 First access Replaced “Power Efficient Data Cache Designs”, ICCD 2003 “Smart Predictors to Turn-off L2 Cache Lines”, under submission

UPC Hw Value Compression Dynamically compress values flowing through the pipeline Good for embedded and high performance processors!! Cache fill Instruction Cache ALU Data Cache Register File exten.exten GG G Cache fill Writeback 32-bit embedded processor pipeline with value compression “Very Low Power Pipelines using Significance Compression”, MICRO-33

UPC Compiler directed Value Compression Original Code After Value Range Propagation After Value Range Specialization CMP BR Narrow operations according to its operands compression Duplicate and specialize certain regions of code “Software-Controlled Operand-Gating”, CGO 2004