Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Similar presentations


Presentation on theme: "Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007."— Presentation transcript:

1 Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007

2 Overview Previous Architectures Previous Architectures New Hybrid Architecture New Hybrid Architecture Possible Benefits Possible Benefits Scrutiny Scrutiny Experimental Results Experimental Results Relation to Project Relation to Project

3 Something Old CMP (single-Chip Multi-core Processors) CMP (single-Chip Multi-core Processors) Two or more independent cores Two or more independent cores Single ISA heterogeneous multiprocessors Single ISA heterogeneous multiprocessors Cores of varying size, performance Cores of varying size, performance Same ISA Same ISA Improve throughput for multi-threaded Improve throughput for multi-threaded Single-Threaded? Single-Threaded?

4 Superscalar Increase performance w/o recompiling Increase performance w/o recompiling Efficiently handle runtime events Efficiently handle runtime events Branch Direction Branch Direction Target Address Target Address Load Latency Load Latency Memory Dependency Memory Dependency Limited ILP: Hardware Instruction Window Limited ILP: Hardware Instruction Window

5 VLIW Very Long Instruction Word Very Long Instruction Word Shift Hardware complexity to compiler Shift Hardware complexity to compiler High Clock Frequency High Clock Frequency Energy-Efficient Energy-Efficient No need to analyze data dependency No need to analyze data dependency No scheduling of independent instruction No scheduling of independent instruction

6 Something New Dual-Core Architecture [1] Bus-based snooping Communicate Using L2 In Future: Interconnections Small operand transfer buffer

7 Potential Benefits VLIW core can operate at high clock rate VLIW core can operate at high clock rate Simple Superscalar core Simple Superscalar core More aggressive compiler optimization More aggressive compiler optimization Due to the superscalar speculative operations Due to the superscalar speculative operations Simple hardware Simple hardware Energy Efficient Energy Efficient Scalable Scalable

8 Hybrid Compiler At TLP aware of: At TLP aware of: Execution Bandwidth Execution Bandwidth Frequencies Frequencies At ILP: At ILP: Architectural details of Superscalar? Architectural details of Superscalar? # functional units and latencies of VLIW # functional units and latencies of VLIW Helper threads Helper threads

9 Optimization Phases Phase 1 Phase 1 Exploit speculative threads (helper threads) Exploit speculative threads (helper threads) Phase 2 Phase 2 Extract non-speculative multi-grain parallelism Extract non-speculative multi-grain parallelism Partition source code Partition source code Predictable (static analysis or profiling) Predictable (static analysis or profiling) Unpredictable (suitable for superscalar core) Unpredictable (suitable for superscalar core) A lot more … A lot more …

10 Did that sound right? Will the data be in the L2 cache when the VLIW core needs it? Will the data be in the L2 cache when the VLIW core needs it?

11 What if?

12 Pre-Execution Not a new idea Not a new idea Using superscalar core to minimize L2 miss stalls Using superscalar core to minimize L2 miss stalls Stalling VLIW pipelines Stalling VLIW pipelines Predictable load latencies? Predictable load latencies? Cache profiling Cache profiling

13 Definitions Delinquent Loads Delinquent Loads Small number of load operations are responsible for the majority of data cache misses. Small number of load operations are responsible for the majority of data cache misses. Delinquent Loads Threshold Delinquent Loads Threshold A pre-set threshold for number of allowable stall cycles caused by a static load instruction A pre-set threshold for number of allowable stall cycles caused by a static load instruction

14 Pre-Execution Thread Make load operations non-faulting Make load operations non-faulting Remove all store operations Remove all store operations

15 Evaluation Simulated Cores [1] Simulated Cores [1]

16 Evaluation (2) Hybrid compiler built upon Trimaran compiler Hybrid compiler built upon Trimaran compiler A cycle-accurate model A cycle-accurate model Based on integration of Based on integration of VLIW simulator from Trimaran VLIW simulator from Trimaran Superscalar simulator: simplescalar Superscalar simulator: simplescalar

17 Evaluation (3) Seven single-threaded applications from Seven single-threaded applications from SPEC 2000 INT SPEC 2000 INT SPEC 92 FP SPEC 92 FP

18 Base, Pre-Execution, Prefetch

19 L2 Miss Latency

20 Delinquent Loads Threshold

21 Relation? Relation to course project Relation to course project Project focuses on scalability of optimization techniques Project focuses on scalability of optimization techniques Relation to course Relation to course How multi-cores can help single-threaded applications How multi-cores can help single-threaded applications

22 Reference [1] Yan J., Zhang W., "Hybrid multi-core architecture for boosting single-threaded performance", ACM SIGARCH Computer Architecture News 35(1): 141-148, 2007 [1] Yan J., Zhang W., "Hybrid multi-core architecture for boosting single-threaded performance", ACM SIGARCH Computer Architecture News 35(1): 141-148, 2007


Download ppt "Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007."

Similar presentations


Ads by Google