Superscalar Architecture Design Framework for DSP Operations Rehan Ahmed.

Slides:

Advertisements

Similar presentations

Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

CSCI 4717/5717 Computer Architecture

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

THE AMD-K7 TM PROCESSOR Microprocessor Forum 1998 Dirk Meyer.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 Microprocessor-based Systems Course 4 - Microprocessors.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

IA- 32 Architecture Richard Eckert Anthony Marino Matt Morrison Steve Sonntag.

Out of Order SuperScalar Ankit Sethia Daya Shanker Gaurav Chadha Kuldeep Singh.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.

Lecture 1 ECE Spring 2000 ECE 291 Spring 2000 Lecture 1: Microprocessor Evolution & Organization Constantine D. Polychronopoulos Professor, ECE.

The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Types of Pentium Processors Pentium Pentium MMX Pentium Pro Pentium II.

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.

Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CSCE614 Hyunjun Jang Texas A&M University.

Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)

CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

The Intel 86 Family of Processors

P5 Micro architecture : Intel’s Fifth generation

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Fetch Directed Prefetching - a Study

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

The Pentium Series CS 585: Computer Architecture Summer 2002 Tim Barto.

UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.

Protection in Virtual Mode

ALPHA Introduction I- Stream

Timing Model of a Superscalar O-o-O processor in HAsim Framework

Computer Architectures M

Case Studies MAINAK CS422 1 CS422 MAINAK CS422 MAINAK 1.

Flow Path Model of Superscalars

Introduction to Pentium Processor

Lecture: SMT, Cache Hierarchies

The Microarchitecture of the Pentium 4 processor

Superscalar Pipelines Part 2

Lecture 11: Memory Data Flow Techniques

Comparison of Two Processors

Alpha Microarchitecture

Resource Replication 6 Integer Units 4 FP units 8 Sets of architectural registers Renaming registers (Int/FP) HW Context (PC, Return Stack.

Presentation transcript:

Superscalar Architecture Design Framework for DSP Operations Rehan Ahmed

Overview Optimization tool. Alters superscalar architectural configuration parameters to suit a given DSP application. It alters the architectural blocks (Number of ALU, Cache Size etc).

Motivation Giving designers an initial idea about how their design should look like. Particularly useful for software defined radio applications.

Optimizations can target both power consumption and speed. Target Function: Simplescalar WATTCH Stage 1: Search and optimization algorithm (Simulated Annealing) Stage 2: Heuristic Approach

Simulated Annealing

Simulated Annealing Parameter set Sr NoParameterConfiguration 1IFQ1, 2, 4, 16, 32 2Branch Table16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, RAS16, 32, 64, 128, 256, 512, 1024, 2048, 4096, BTB16 4, 32 4, 64 4, 128 4, 256 4, 512 4, , , , Decode Width1, 2, 4, 16, 32 6Issue Width1, 2, 4, 16, 32 7Commit Width1, 2, 4, 16, 32 8RUU8, 16, 32, 64, 128, 256, 512, LSQ8, 16, 32, 64, 128, 256, 512, I Cache4:32:4:l, 8:32:4:l, 16:32:4:l, 32:32:4:l, 64:32:4:l, 128:32:4:l, 256:32:4:l, 1024:32:4:l, 2048:32:4:l, 8192:32:4:l 11D Cache4:32:4:l, 8:32:4:l, 16:32:4:l, 32:32:4:l, 64:32:4:l, 128:32:4:l, 256:32:4:l, 1024:32:4:l, 2048:32:4:l 12Bus Width4, 8, 16, 32, 64 13I TIB1:1024:4:l, 2:1024:4:l, 4:1024:4:l, 8:1024:4:l, 16:1024:4:l, 32:1024:4:l, 64:1024:4:l, 128:1024:4:l 14D TLB1:1024:4:l, 2:1024:4:l, 4:1024:4:l, 8:1024:4:l, 16:1024:4:l, 32:1024:4:l, 64:1024:4:l, 128:1024:4:l 15I ALU1, 2, 4, 8 16I Mul/Div1, 2, 4, 8 17Memory Ports1, 2, 4, 8 18FP ALU1, 2, 4, 8 19FP Mul/Div1, 2, 4, 8

Final configuration from simulated annealing further optimized using the heuristic approach Heuristic approach based on the operating principle of superscalar architecture.

Configuration ChangeMonitored Resultdir =0dir=1 1Branch TableBranch_MissesIncrDec 2BTBGainIncrDec 3Return Address StackGainIncrDec 4IFQ, Exec Win, I ALUIFQ_full, Eff_Gain, IPBIncrDec 5I ALUGainIncrDec 6I Mul/DivGainIncrDec 7FP ALUGainIncrDec 8FP Mul/DivGainIncrDec 9RUU Gain DecInc 10LSQ Gain DecInc 11I-Compress Gain En 12I-Cache Gain DecInc 13D-Cache Gain DecInc 14Instruction TLB Gain DecInc 15Data TLB Gain DecInc 16Bus Width Gain IncDec 17Memory To System Ports Gain IncDec 18Exit Stage Gain Nil

Optimization Results IFFT Operation Scale=40 (High precedence given to efficiency)

Results Summary Optimized Configuration performance measures Instructions per Cycle: Average Power per Instruction: Instructions per second (1GHz) G Transistor Count10,645,929 Transistor Count for Pentium III9,500,000

IFFT Configuration Parameter Instruction Fetch Queue32 Branch Table Size32768 Return Address Stack16 Branch Target Buffer1024 Instruction Decode Width32 Instruction Issue Width2 Instruction Commit Width32 Register Update Unit16 Load Store Queue8 D Cache2 KB I Cache4KB Memory Bus Width64 bytes Instruction TLB32KB Data TLB16 KB Integer ALUs4 Integer Mul/Div1 Memory to System Ports2 Floating Point ALU1 Floating Point Mul/Div4