From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance Without ILP or Speculation Sami YEHIA.

Slides:

Advertisements

Similar presentations

Asanovic/Devadas Spring VLIW/EPIC: Statically Scheduled ILP Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.

Advertisements

Lecture 4: CPU Performance

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar1 CIS 501: Computer Architecture Unit 8: Superscalar Pipelines Slides developed by Joe Devietti,

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Instruction-Level Parallelism (ILP)

Reconfigurable Microprocessors Lih Wen Koh 05s1 COMP4211 presentation 18 May 2005.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

Instruction Level Parallelism (ILP) Colin Stevens.

11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.

Combinational circuits

1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.

Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,

Spring 2003CSE P5481 Midterm Philosophy What the exam looks like. Definitions, comparisons, advantages & disadvantages what is it? how does it work? why.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

ECE 252 / CPS 220 Pipelining Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2008.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Final Review Prof. Mike Schulte Advanced Computer Architecture ECE 401.

CBP 2005Comp 3070 Computer Architecture1 Last Time … All instructions the same length We learned to program MIPS And a bit about Intel’s x86 Instructions.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

PipeliningPipelining Computer Architecture (Fall 2006)

Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.

SECTIONS 1-7 By Astha Chawla

‘99 ACM/IEEE International Symposium on Computer Architecture

Architecture & Organization 1

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

ECE/CS 552: Pipelining to Superscalar

Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Out of Order Processors

Architecture & Organization 1

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Serial versus Pipelined Execution

Dr. Javier Navaridas Pipelining Dr. Javier Navaridas COMP25212 System Architecture.

Siddhartha Chatterjee Spring 2008

Advanced Computer Architecture

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Daxia Ge Friday February 9th, 2007

ECE/CS 552: Pipelining to Superscalar

Dynamic Hardware Prediction

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

The University of Adelaide, School of Computer Science

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Out-of-Order Execution (OoOE)

Presentation transcript:

From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance Without ILP or Speculation Sami YEHIA and Olivier TEMAM LRI, Paris South University France

2/18 Scaling Up Processors  Larger pipelines, caches, instruction windows and reservation stations  Aggressive speculation mechanisms : branch prediction, value prediction, data prefetching..  Rely on ILP exploitation  What about scaling with little ILP?

3/18 Concept  2 64*num_registers input! (Theoretically) … addq r1,r2,r3 subq r3,10,r4 … sll r5,6,r6 addq r5,r5,r4 Program r1r2r3rn r6 = f 1 (r1,r2,…,rn)r4 = f 2 (r1,r2,…,rn) Logic circuit r1 63 r1 62 r1 61 r1 1 r1 0 f1 63 f1 62 f1 61 f1 1 f1 0  Combinatorial Functions  A sequence of instructions is a set of functions

4/18 Principles  An « independent » Function for each output f r3 (r9,r10) = r9 + r10 – 1 f r4 (r9,r10) = sign_extension(r9 + r10 – 1)31:0 f r5 (r9,r10) = ((r9 + r10 – 1) > 1 f br (r9,r10) = (r9 + r10 – 1)  ((r9 + r10 – 1) >1) DFG

5/18 Hardware Operator + + ab out c f1f1 f1 i = f’(a i,b i,cout1 i-1 ) cout1 i =f’ c (a i,b i,cout1 i-1 ) out i = f’’(f1 i,c i,cout2 i-1 ) = f’’(a i,b i,c i,cout1 i-1,cout2 i-1 ) cout2 i = f’’ c (a i,b i,c i,cout1 i-1,cout2 i-1 )  Eliminate dependencies to calculate a+b+c  r10 + r9 –1 to hardware operators

6/18 Complexity Effectiveness  Scalability of ILP Vs. Functions Complexity Performance ILP exploitation Functions

7/18 Related Work  ASIC  General-Purpose context 3-1 Interlock Collapsing ALU [Y. Sazeides, S. Vassiliadis and J. Smith, Micro’ 29, 1996] Chimaera [Z. YE et al., ISCA’ 27, 2000] Grid Processors [R. Nagarajan et al., MICRO’ 34, 2001] Cascade one or more hardware operators to execute specific functions AND OR XOR ANDORXOR Adder   

8/18 Building Functions  From traces of instructions to configuration macros compilation toolchain to study: Potential of the approach Performance analysis on a superscalar processor Traces

9/18 Potential of the Approach  Cuts : limits to DFG collapsing (height) Number of inputs Non-collapsable instructions Load instructions (27,7 %) Carries from upper significant bits  Theoretical speedup  The lower the ILP the higher speedup op LD op mem F2 mem op

10/18 Theoretical Speedup

11/18 Number of Inputs

12/18 Non Collapsable Instructions

13/18 Implementation rePlay Framework

14/18 Performance Evaluation

15/18 RePlay Optimization Engine Delay  Function built “offline”

16/18 Latency of Function units

17/18 Future Work  Address prediction to overcome Load cuts Address Prediction & Cache Preloading op LD op mem F2 mem op LD op F2 mem

18/18 Q & A

Carries from Upper Significant Bits

Optimization Engine Delay

Latency of Function units