Data-path Synthesis of VLIW Video Signal Processor Zhao Wu and Wayne Wolf Dept. of Electrical Engineering, Princeton University.

Slides:

Advertisements

Similar presentations

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

CS2100 Computer Organisation Performance (AY2014/2015) Semester 2.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

1 An adaptable FPGA-based System for Regular Expression Matching Department of Computer Science and Information Engineering National Cheng Kung University,

Instruction Level Parallelism (ILP) Colin Stevens.

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

Multiscalar processors

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

ECE 232 L1 Intro.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 1 Introduction.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

L29:Lower Power Embedded Architecture Design 성균관대학교 조 준 동 교수,

Chapter 1 Introduction. Computer Architecture selecting and interconnecting hardware components to create computers that meet functional, performance.

Computer Architecture ECE 4801 Berk Sunar Erkay Savas.

DOP - A CPU CORE FOR TEACHING BASICS OF COMPUTER ARCHITECTURE Miloš Bečvář, Alois Pluháček and Jiří Daněček Department of Computer Science and Engineering.

Basics and Architectures

Computer Architecture and Organization Introduction.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:

Automated Design of Custom Architecture Tulika Mitra

Designing the WRAMP Dean Armstrong The University of Waikato.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Chapter 1 Introduction. Architecture & Organization 1 Architecture is those attributes visible to the programmer —Instruction set, number of bits used.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

A Methodology for Architecture Exploration of heterogeneous Signal Processing Systems Paul Lieverse, Pieter van der Wolf, Ed Deprettere, Kees Vissers.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

1 Power estimation in the algorithmic and register-transfer level September 25, 2006 Chong-Min Kyung.

Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

Copyright © 2007 Elsevier Digital Design and Computer Architecture David Money Harris and Sarah L. Harris.

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Application-Specific Customization of Soft Processor Microarchitecture

COMP541 Datapaths I Montek Singh Mar 28, 2012.

Computer Organization & Design Microcode for Control Sec. 5

课程名编译原理 Compiling Techniques

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Course Description Algorithms are: Recipes for solving problems.

CSCI1600: Embedded and Real Time Software

Improved schedulability on the ρVEX polymorphic VLIW processor

ECEG-3202 Computer Architecture and Organization

Yingmin Li Ting Yan Qi Zhao

Ghifar Parahyangan Catholic University August 22, 2011

ECEG-3202 Computer Architecture and Organization

What time is it?. What time is it? Major Concepts: a data structure model: basic representation of data, such as integers, logic values, and characters.

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Midterm 2 review Chapter

Overview Prof. Eric Rotenberg

Course Description Algorithms are: Recipes for solving problems.

Computer Architecture

Application-Specific Customization of Soft Processor Microarchitecture

CSCI1600: Embedded and Real Time Software

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

CS Introduction to Operating Systems

Presentation transcript:

Data-path Synthesis of VLIW Video Signal Processor Zhao Wu and Wayne Wolf Dept. of Electrical Engineering, Princeton University

Outline IntroductionIntroduction Architectural paradigmArchitectural paradigm Trace-driven simulationTrace-driven simulation Performance estimationPerformance estimation ConclusionsConclusions

Introduction Why programmable VSP?Why programmable VSP? –intense computation –complex and diverse video applications –increased development cost –time-to-market pressure Why VLIW?Why VLIW? –Easy to implement in hardware –high speed –high degree of ILP available in video applications

Architecture Paradigm

Architectural Parameters Register fileRegister file –number of registers Functional unitFunctional unit –number and type of functional units InterconnectInterconnect –number of clusters –interconnect mechanism

Impact on MPEG-2 Encoder

Trace-Driven Scheduling Binary program prog Disassembled program prog.asm Run pixie -idtrace Run dis -h Dynamic trace Scheduler Result & statistics Resource description Instrumented program prog.pixie

Block Diagram of the Scheduler Dependency analyzer disassembled program Register manager Result & statistics Resource description Scheduling record Assembly code parser Memory manager Funct unit manager Register scoreboard Memory scoreboard Reservation station VLIW scheduler Resource manager Program trace

Features of the Scheduler (Relatively) fast(Relatively) fast –Instrumentation rather than interpretation –linear to trace length Moderate memory requirementModerate memory requirement –Pipelining saves storage Large scheduling windowLarge scheduling window –up to 10 9 instructions –simulates both a VLIW compiler & a VLIW processor Realistic modelRealistic model –limited resources

Performance Estimation Why do we need performance estimation?Why do we need performance estimation? –trace-driven simulation too slow (trace too long) –design space too big How do we estimate?How do we estimate? –start from full-length trace simulation results –increase resource: lower bound on cycle count –decrease resource: upper bound on cycle count target design bigger design smaller design

IPC Histogram of ALU Average IPC ALU = Average IPC ALU = 13.24

Increase and Decrease Resources

Decrease resource Split cycles that issue more FU ops and retimeSplit cycles that issue more FU ops and retime –16  8+8, 15  8+7, 14  8+6, 13  8+5, 12  8+4, … Why upper bound of cycle countWhy upper bound of cycle count –7, 6, 5, 4, … could be combined with 1, 2, 3, 4, …

Increase resource T new = T old - T 8T new = T old - T 8 –16  8+8, 15  8+7, 14  8+6, 13  8+5, 12  8+4, … Why lower bound of cycle countWhy lower bound of cycle count –sometimes can’t merge (e.g. increase from 8 to 12) –sometimes no parallelism This cycle removed

Change More Than One Resource Have to take into account resource inter-correlationHave to take into account resource inter-correlation –{}: # of cycles when at least one - instruction depends on -instructions –{ dep res1,res2,n }: # of cycles when at least one res1 - instruction depends on n res2 -instructions Combine several bounds into one semi-boundCombine several bounds into one semi-bound Increase resource (m>n):Increase resource (m>n): Decrease resource (m<n):Decrease resource (m<n):

Results

Conclusions Trace-driven simulationTrace-driven simulation –quantitative evaluation of an architecture –too slow to be applied for every possible design Performance estimationPerformance estimation –based on simulated results –automated procedure –accurate enough