Application of Instruction Analysis/Synthesis Tools to x86’s Functional Unit Allocation Ing-Jer Huang and Ping-Huei Xie Institute of Computer & Information.

Slides:



Advertisements
Similar presentations
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Advertisements

Computer Organization and Architecture
Computer architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Register Renaming & Value Prediction. Overview ► Need for Post-RISC ► Register Renaming vs. Allocation Strategies ► How to compile for Post-RISC machines.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Operating System Kernels1 Operating System Support for Performance Monitoring Witawas Srisa-an Chapter: not in the book.
Superscalar Implementation Simultaneously fetch multiple instructions Logic to determine true dependencies involving register values Mechanisms to communicate.
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Hyper-Threading Technology Architecture and Micro-Architecture.
Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.
Performance of mathematical software Agner Fog Technical University of Denmark
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
Pipelining and Parallelism Mark Staveley
Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Computer Architecture Principles Dr. Mike Frank
Visit for more Learning Resources
William Stallings Computer Organization and Architecture 8th Edition
Chapter 14 Instruction Level Parallelism and Superscalar Processors
CS203 – Advanced Computer Architecture
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Improving java performance using Dynamic Method Migration on FPGAs
Lecture 12 Reorder Buffers
Flow Path Model of Superscalars
Instruction Level Parallelism and Superscalar Processors
Superscalar Processors & VLIW Processors
Superscalar Pipelines Part 2
Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
IA-64 Microarchitecture --- Itanium Processor
Instruction Level Parallelism and Superscalar Processors
Coe818 Advanced Computer Architecture
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
* From AMD 1996 Publication #18522 Revision E
Computer Architecture
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
Overview Prof. Eric Rotenberg
Created by Vivi Sahfitri
Research: Past, Present and Future
Presentation transcript:

Application of Instruction Analysis/Synthesis Tools to x86’s Functional Unit Allocation Ing-Jer Huang and Ping-Huei Xie Institute of Computer & Information Engineering National Sun Yat-sen University Kaohsiung, Taiwan R. O. C.

Superscalar Model under Investigation Decoupled superscalar architecture –register renaming –branch prediction Assumptions –no cache miss –fast instruction fetcher and decoder –100% branch prediction correct –load/store unit: 2 cycles; others: 1 cycle –large RS and ROB

The Problem Q: How many functional units are needed in an x86 compatible superscalar core? A: The distribution of functional unit usage in typical x86 programs 1A, 1M, 1B 2A, 1M, 0B 2A, 2M, 1B 3A, 0M, 0B 4A, 2M, 1B FU Usage Frequency

How to Obtain FU Distribution? Simulation-based approaches [Shinatani, 1995], [Davidson, 1995], [Hara et al., 1996], etc. –Running on different CPU platforms –Slow, but can explore many configurations Monitoring-based approaches [Adams et al., 1989], [Bhandarkar et al., 1997], [Huang, 1997], etc. –Directly running on the same CPU platform –Fast, but work for only the configuration of the underlying CPU platform

A Fast Performance/Cost Approximation Environment

ASIA: Automatic Synthesis of Instruction Set Architedcture GOAL: analyzes and synthesizes application-specific instruction set for pipelined uni-processors. APPROACH: a micro-operation scheduling engine based on a simulated annealing algorithm  The superscalar core is an application-specific RISC core for x86 emulation

ASIA-II: Extensions for Superscalar Architecture Register renaming –Temporary registers are used on the fly to resolve anti and data dependencies. Execution window –Instructions are dispatched sequentially. Branch prediction –Effective sizes of basic blocks are enlarged.

Register Renaming In ASIA-II: ignore output, anti dependencies during scheduling

Realistic Patterns in the Execution Window Balanced distribution: 0bjective function includes both time steps and H/W counts Window effect: MOP’s are displaced with a limited distance; long distance is possible with many iterations of displacement.as long as performance is improved.

Basic Block Expansion (Eblocks) Due to Branch Prediction

A Small Example from Word97

Extended Basic Blocks

Scheduled Eblocks

A Small Example - FU Usage

Description of Benchmark

Micro-operation Level Parallelism (MSP)

Functional Unit Usage Notation: A - Integer unit M - Memory unit B - Branch unit F - Floating unit Others is the sum of that frequent less than 1.0%

Accumulated Coverage of Functional Unit Allocation (AMD K6) (IA-64) (NSC 98) (Pentium Pro) (Base Machine)

Conclusions Synthesis/analysis tools have been used to observe the functional unit usage and MLP in superscalar core. Speedup over simulation is over 600 times. FUTURE WORK: investigate various microarchitecture features –register renaming vs. branch prediction –functional unit optimization