Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
CPU Review and Programming Models CT101 – Computing Systems.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
VEGAS: A Soft Vector Processor Aaron Severance Some slides from Prof. Guy Lemieux and Chris Chou 1.
The University of Adelaide, School of Computer Science
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
IXP1200 Microengines Apparao Kodavanti Srinivasa Guntupalli.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Unit -II CPU Organization By- Mr. S. S. Hire. CPU organization.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
W AVEFRONT S KIPPING USING BRAM S FOR C ONDITIONAL A LGORITHMS ON V ECTOR P ROCESSORS Aaron Severance Joe Edwards Guy G.F. Lemieux.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
ALPHA COLLEGE OF ENGINEERING & TECHNOLOGY Introduction to AVR ATMega32 Architecture PREPARED BY SHWETA ( ) BHOOMI ( ) YOGEN ( )
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Nios II Processor: Memory Organization and Access
A Closer Look at Instruction Set Architectures
SECTIONS 1-7 By Astha Chawla
Microprocessor Systems Design I
Embedded Systems Design
Advanced Topic: Alternative Architectures Chapter 9 Objectives
Assembly Language for Intel-Based Computers, 5th Edition
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
/ Computer Architecture and Design
Improving Program Efficiency by Packing Instructions Into Registers
CDA 3101 Spring 2016 Introduction to Computer Organization
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CS170 Computer Organization and Architecture I
Trying to avoid pipeline delays
Multivector and SIMD Computers
Computer Architecture and the Fetch-Execute Cycle
Apparao Kodavanti Srinivasa Guntupalli
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
* From AMD 1996 Publication #18522 Revision E
Final Project presentation
Improving Memory System Performance for Soft Vector Processors
Lecture 22: Cache Hierarchies, Memory
Mattan Erez The University of Texas at Austin
Instruction Set Principles
CSC3050 – Computer Architecture
Mattan Erez The University of Texas at Austin
Lecture 5: Pipeline Wrap-up, Static ILP
ADSP 21065L.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux VIPERS II: A Soft-core Vector Processor with Single-copy Data Scratchpad Memory Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Outline Motivation New Pipeline Structure VIPERS II Architecture Results Conclusion

Motivation VIPERS soft vector processor provides scalable performance for data-parallel applications on FPGAs Original VIPERS has a few shortcomings: High latency for copying data from memory to register file Duplicate copies of data in precious on-chip memory Scalar core not pipelined, and has no debug-core

Duplicate Copies of Data VIPERS uses dual read-port vector register file 2 identical copies of the register file Plus an original copy of data in on-chip memory These data duplicates are wasteful Limited on-chip memory capacity Today’s FPGA offers fast on-chip memories. Why not access the memory directly?

Contribution Use address registers and scratchpad memory to replace vector register file Eliminate slow load/store operations More efficient on-chip memory usage Auto-increment/decrement and circular buffer features Reduce need for loop unrolling Lower loop overhead

Outline Motivation New Pipeline Structure VIPERS II Architecture Results Conclusion 6

New Pipeline Structure Classic 5-stage pipeline Swap the execution stage with the memory access stage Note the name of the stages are changed to memory read and memory write

Implementation The “data” register file is replaced by address registers and a scratchpad memory. Eliminates load/store when data set fits in scratchpad memory.

VIPERS II ISA

Outline Motivation New Pipeline Structure VIPERS II Architecture Results Conclusion 10

VIPERS II Architecture

Architectural Changes Vector address registers Vector scratchpad memory Data alignment crossbar network (DACN) Fracturable ALUs

Vector Address Registers Features auto post-increment, pre-decrement, and circular buffer modes Reduce loop overheads Require less address registers than data registers to implement an application

Vector Address Register

Vector Scratchpad Memory Reduced load/store latencies with simpler memory interface Operate at 2X clock

Vector Scratchpad Memory Efficient data storage Flexible data set size restriction e.g. Median filter benchmark with byte-size data: Add backup slide

Data Alignment Crossbar Network With vector lanes coupled directly to memory, input vectors must be aligned For misaligned operands, vector move instruction (vmov) is used to move data into alignment

Example

Data Alignment Crossbar Network Implemented with multistage switching network to trade off performance for area Crossbar – quadratic growth DACN – Nlog(N) growth, slower

Fracturable ALUs Data elements are stored in their natural length Fracturable ALUs are used to execute on operands with varying widths

Fracturable ALUs

Fracturable ALUs Increased processing power 4-Lane VIPERS II operating on byte-size data is equivalent to having a 16 lanes Mention if VL is increased to 64 to fully utilize the pipeline, it takes as little as 70 cycles per pixel!!

Outline Motivation New Pipeline Structure VIPERS II Architecture Results Conclusion 23

Resource Usage Explain DSP breakdown

Simulated Performance

Hardware Performance

Future Work Increase operating frequency Implement strided and indexed moves Implement DACN with Omega network Alternative implementation of address register

Related Works VESPA (Rose, CASES08) and VIPERS (Lemieux, FPGA08) are two previous soft-core vector processors VIPERS II uses vector scratchpad memory instead of register file IBM’s CELL processor (Pham, ISSCC05) features SRAM scratchpad memory populated by DMA VIPERS II does not require load/store operations Register pointer architecture (Dally, DATE07) reduces need for loop unrolling by dynamically changing the register pointer VIPERS II is the first vector processor to utilize this technique

Conclusion VIPERS II architecture provides many advantages: Improve performance by eliminating slow load/store operations Achieve unrolled performance without unrolling Efficient usage of on-chip memory Increased processing power when executing smaller operands

Thank you

Vector Scratchpad Memory e.g. Largest median filter that can be realized given a 64kb memory budget

Implementation

Strided/Indexed Access Strided/indexed loads are replaced by strided/indexed move operations. Similar to ‘vmov’, strided move ‘vmovs’ simply moves scattered elements to contiguous locations in the memory. e.g. vmovs vA1, vA0, vstride0;

Permutation Requirement Show by figure, offset, stride, and index