Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Lecture 6: Multicore Systems

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Instruction Level Parallelism (ILP) Colin Stevens.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Multiscalar processors

How Multi-threading can increase on-chip parallelism

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Multi-core architectures. Single-core computer Single-core CPU chip.

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Spring 2003CSE P5481 Midterm Philosophy What the exam looks like. Definitions, comparisons, advantages & disadvantages what is it? how does it work? why.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Pipelining and Parallelism Mark Staveley

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

My Coordinates Office EM G.27 contact time:

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

CS 352H: Computer Systems Architecture

COMP 740: Computer Architecture and Implementation

CS Lecture 20 The Case for a Single-Chip Multiprocessor

Multi-core processors

5.2 Eleven Advanced Optimizations of Cache Performance

/ Computer Architecture and Design

Superscalar Processors & VLIW Processors

Levels of Parallelism within a Single Processor

Hardware Multithreading

How to improve (decrease) CPI

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Computer Architecture: A Science of Tradeoffs

Levels of Parallelism within a Single Processor

CSC3050 – Computer Architecture

8 – Simultaneous Multithreading

The University of Adelaide, School of Computer Science

Presentation transcript:

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007

Overview Previous Architectures Previous Architectures New Hybrid Architecture New Hybrid Architecture Possible Benefits Possible Benefits Scrutiny Scrutiny Experimental Results Experimental Results Relation to Project Relation to Project

Something Old CMP (single-Chip Multi-core Processors) CMP (single-Chip Multi-core Processors) Two or more independent cores Two or more independent cores Single ISA heterogeneous multiprocessors Single ISA heterogeneous multiprocessors Cores of varying size, performance Cores of varying size, performance Same ISA Same ISA Improve throughput for multi-threaded Improve throughput for multi-threaded Single-Threaded? Single-Threaded?

Superscalar Increase performance w/o recompiling Increase performance w/o recompiling Efficiently handle runtime events Efficiently handle runtime events Branch Direction Branch Direction Target Address Target Address Load Latency Load Latency Memory Dependency Memory Dependency Limited ILP: Hardware Instruction Window Limited ILP: Hardware Instruction Window

VLIW Very Long Instruction Word Very Long Instruction Word Shift Hardware complexity to compiler Shift Hardware complexity to compiler High Clock Frequency High Clock Frequency Energy-Efficient Energy-Efficient No need to analyze data dependency No need to analyze data dependency No scheduling of independent instruction No scheduling of independent instruction

Something New Dual-Core Architecture [1] Bus-based snooping Communicate Using L2 In Future: Interconnections Small operand transfer buffer

Potential Benefits VLIW core can operate at high clock rate VLIW core can operate at high clock rate Simple Superscalar core Simple Superscalar core More aggressive compiler optimization More aggressive compiler optimization Due to the superscalar speculative operations Due to the superscalar speculative operations Simple hardware Simple hardware Energy Efficient Energy Efficient Scalable Scalable

Hybrid Compiler At TLP aware of: At TLP aware of: Execution Bandwidth Execution Bandwidth Frequencies Frequencies At ILP: At ILP: Architectural details of Superscalar? Architectural details of Superscalar? # functional units and latencies of VLIW # functional units and latencies of VLIW Helper threads Helper threads

Optimization Phases Phase 1 Phase 1 Exploit speculative threads (helper threads) Exploit speculative threads (helper threads) Phase 2 Phase 2 Extract non-speculative multi-grain parallelism Extract non-speculative multi-grain parallelism Partition source code Partition source code Predictable (static analysis or profiling) Predictable (static analysis or profiling) Unpredictable (suitable for superscalar core) Unpredictable (suitable for superscalar core) A lot more … A lot more …

Did that sound right? Will the data be in the L2 cache when the VLIW core needs it? Will the data be in the L2 cache when the VLIW core needs it?

What if?

Pre-Execution Not a new idea Not a new idea Using superscalar core to minimize L2 miss stalls Using superscalar core to minimize L2 miss stalls Stalling VLIW pipelines Stalling VLIW pipelines Predictable load latencies? Predictable load latencies? Cache profiling Cache profiling

Definitions Delinquent Loads Delinquent Loads Small number of load operations are responsible for the majority of data cache misses. Small number of load operations are responsible for the majority of data cache misses. Delinquent Loads Threshold Delinquent Loads Threshold A pre-set threshold for number of allowable stall cycles caused by a static load instruction A pre-set threshold for number of allowable stall cycles caused by a static load instruction

Pre-Execution Thread Make load operations non-faulting Make load operations non-faulting Remove all store operations Remove all store operations

Evaluation Simulated Cores [1] Simulated Cores [1]

Evaluation (2) Hybrid compiler built upon Trimaran compiler Hybrid compiler built upon Trimaran compiler A cycle-accurate model A cycle-accurate model Based on integration of Based on integration of VLIW simulator from Trimaran VLIW simulator from Trimaran Superscalar simulator: simplescalar Superscalar simulator: simplescalar

Evaluation (3) Seven single-threaded applications from Seven single-threaded applications from SPEC 2000 INT SPEC 2000 INT SPEC 92 FP SPEC 92 FP

Base, Pre-Execution, Prefetch

L2 Miss Latency

Delinquent Loads Threshold

Relation? Relation to course project Relation to course project Project focuses on scalability of optimization techniques Project focuses on scalability of optimization techniques Relation to course Relation to course How multi-cores can help single-threaded applications How multi-cores can help single-threaded applications

Reference [1] Yan J., Zhang W., "Hybrid multi-core architecture for boosting single-threaded performance", ACM SIGARCH Computer Architecture News 35(1): , 2007 [1] Yan J., Zhang W., "Hybrid multi-core architecture for boosting single-threaded performance", ACM SIGARCH Computer Architecture News 35(1): , 2007