VEGAS: A Soft Vector Processor Aaron Severance Some slides from Prof. Guy Lemieux and Chris Chou 1.

Slides:



Advertisements
Similar presentations
Computer Organization, Bus Structure
Advertisements

S. Venkatesan Department of Computer Science Spring 2003 Operating Systems Principles Class 1.
Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia.
Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Instruction Set Design
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
CPU Review and Programming Models CT101 – Computing Systems.
C++  PPL  AMP When NO branches between a micro-op and retiring to the visible architectural state – its no longer speculative.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.
1 VENICE A Soft Vector Processor Aaron Severance Advised by Prof. Guy Lemieux Zhiduo Liu, Chris Chou, Jason Yu, Alex Brant, Maxime Perreault, Chris Eagleston.
Avishai Wool lecture Introduction to Systems Programming Lecture 8 Input-Output.
1 SHARC ‘S’uper ‘H’arvard ‘ARC’hitecture Nagendra Doddapaneni ER hit HAR ect VARD ure SUP Arc.
Slide 1Michael Flynn EE382 Winter/99 EE382 Processor Design Winter Chapter 7 and Green Book Lectures Concurrent Processors, including SIMD and.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Operating Systems Béat Hirsbrunner Main Reference: William Stallings, Operating Systems: Internals and Design Principles, 6 th Edition, Prentice Hall 2009.
Processor Architecture Kieran Mathieson. Outline Memory CPU Structure Design a CPU Programming Design Issues.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.
Embedded Supercomputing in FPGAs
2007 Oct 18SYSC2001* - Dept. Systems and Computer Engineering, Carleton University Fall SYSC2001-Ch7.ppt 1 Chapter 7 Input/Output 7.1 External Devices.
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 1 Computer System Overview.
Computers Internal Communication. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.
Interrupts By Ryan Morris. Overview ● I/O Paradigm ● Synchronization ● Polling ● Control and Status Registers ● Interrupt Driven I/O ● Importance of Interrupts.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CUDA - 2.
1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Introduction to MMX, XMM, SSE and SSE2 Technology
Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th, 2012 Accelerator Compiler for the VENICE Vector Processor.
DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.
Embedded System. What is an Embedded System? Computing systems embedded within electronic devices Hard to define – Nearly any computing system other than.
MS108 Computer System I Lecture 3 ISA Prof. Xiaoyao Liang 2015/3/13 1.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
بسم الله الرحمن الرحيم MEMORY AND I/O.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
Zhiduo Liu Aaron Severance Satnam Singh Guy Lemieux Accelerator Compiler for the VENICE Vector Processor.
©2000 Addison Wesley Little- and big-endian memory organizations.
W AVEFRONT S KIPPING USING BRAM S FOR C ONDITIONAL A LGORITHMS ON V ECTOR P ROCESSORS Aaron Severance Joe Edwards Guy G.F. Lemieux.
CS501 Advanced Computer Architecture Lecture 29 Dr.Noor Muhammad Sheikh.
Computer Science 516 Intel x86 Overview. Intel x86 Family Eight-bit 8080, 8085 – 1970s 16-bit 8086 – was internally 16 bits, externally 8 bits.
EEL 4709C Prof. Watson Herman Group 4 Ali Alshamma, Derek Montgomery, David Ortiz 11/11/2008.
Instruction Packing for a 32-bit Stack-Based Processor Witcharat Lertteerawattana and Prabhas Chongstitvatana Department of Computer Engineering Chulalongkorn.
EECE571R -- Harnessing Massively Parallel Processors ece
Distributed Processors
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Overview Introduction General Register Organization Stack Organization
I/O system.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Prof. Zhang Gang School of Computer Sci. & Tech.
CS703 - Advanced Operating Systems
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Computer Architecture
MMX Multi Media eXtensions
Array Processor.
Architectural Overview
Operating Systems Chapter 5: Input/Output Management
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Computer Architecture
Persistence: I/O devices
Course Outline for Computer Architecture
Customizable Soft Vector Processors
WJEC GCSE Computer Science
ESE532: System-on-a-Chip Architecture
Presentation transcript:

VEGAS: A Soft Vector Processor Aaron Severance Some slides from Prof. Guy Lemieux and Chris Chou 1

Outline  Motivation  Vector Processing Overview  VEGAS Architecture  Example programs  Advanced Features 2

Motivation  DE1/DE2 Audio/Video processing options NIOS: Easy but slow Customize system: Fast but hard VEGAS: Pretty fast, pretty easy  VEGAS processor is in v4 build of UBC’s DE1 media computer Speed up applications yet still write C code 3

Overview of Vector Processing

5 Acceleration with Vector Processing  Organize data as long vectors  Data-level parallelism  Vector instruction execution Multiple vector lanes (SIMD) Repeated SIMD operation over length of vector Sourcevectorregisters Destinationvectorregister Vector lanes for (i=0; i<NELEM; i++) a[i] = b[i] * c[i] vmult a, b, c

6 Advantages of Vector Processing  Simple programming model Short to long vector data parallelism Regular, easy to accelerate  Scalable performance and area DE1 only has room for one vector lane, but removing other components could make room for more Larger FPGAs can support multiple lanes  Same exact code runs faster

7 Hybrid vector-SIMD for( i=0; i<NELEM; i++ ) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] } C E C E

VEGAS Architecture

Scalar Core: 200MHz DMA Engine & External DDR2 Vector Core: 120MHz Concurrent Execution FIFO synchronized 9

10 Key Features of VEGAS  Configurable vector processor Selectable performance/area tradeoff  Working in FPGA: 1 lane … 128 lanes  More lanes possible Fracturable ALUs: 1x32, 2x16, 4x8 Scratchpad-based “register file”  Very long vectors  Explicitly managed memory communication

One vector (eg, V0) No vector length restrictions No address alignment (starting offset) restrictions Distributed Vector data Scratchpad Memory + AF

Scratchpad Memory in Action Vector Scratchpad Memory Vector Lane 0 Vector Lane 1 Vector Lane 2 Vector Lane 3 srcAsrcBDestsrcAsrcBDest 12

Scratchpad Memory in Action srcA Dest 13

Performance 14 BenchmarkNiosII/fVEGASNiosII/V32 Speedup V1V32 fir x motest x median x autocor x conven x imgblend x filt3x x

Example Problems

Overall Process 1. Allocate vectors in scratchpad 2. Move data from memory  scratchpad 3. Point vector address registers to data in scratchpad 4. Perform vector operation 5. Move data from scratchpad  memory 6. Check result using Nios 16

Example #1: Vector * Constant int data[128] = { 0, 1, 2, 3, 4, 5,..., 127 }; int multiplier = 3;  Allocate vectors in scratchpad int *vector_data; vector_data = vegas_malloc( 128*4 ); // 128 words long, in scratchpad  Move data from memory  scratchpad vegas_dma_to_vector( vector_data, data, 128*4 );// copy from ‘data’  Point vector address registers to data in scratchpad vegas_set( VADDR, V1, vector_data ); // can use V1.. V7 address reg. vegas_set( VCTRL, VL, 128 ); // # of elements  Perform vector operation vegas_wait_for_dma(); // wait for DMA copy to finish vegas_vsw( VMULLO, V1, V1, multiplier ); // only 1 VEGAS instruction  Move data from scratchpad  memory vegas_instr_sync(); // wait for all VEGAS instr vegas_dma_to_host( data, vector_out, 128*4 ); // copy results back vegas_wait_for_dma(); // wait for DMA copy to finish 17

Example #1: Vector * Constant int data[128] = { 0, 1, 2, 3, 4, 5,..., 127 }; int multiplier = 3; 18

Example #1: Vector * Constant int data[128] = { 0, 1, 2, 3, 4, 5,..., 127 }; int multiplier = 3;  Allocate vectors in scratchpad  Move data from memory  scratchpad  Point vector address registers to data in scratchpad  Perform vector operation  Move data from scratchpad  memory 19

Example #1: Vector * Constant int data[128] = { 0, 1, 2, 3, 4, 5,..., 127 }; int multiplier = 3;  Allocate vectors in scratchpad int *vector_data; vector_data = vegas_malloc( 128*4 ); // 128 words long, in scratchpad  Move data from memory  scratchpad  Point vector address registers to data in scratchpad  Perform vector operation  Move data from scratchpad  memory 20

Example #1: Vector * Constant int data[128] = { 0, 1, 2, 3, 4, 5,..., 127 }; int multiplier = 3;  Allocate vectors in scratchpad int *vector_data; vector_data = vegas_malloc( 128*4 ); // 128 words long, in scratchpad  Move data from memory  scratchpad vegas_dma_to_vector( vector_data, data, 128*4 );// copy from ‘data’  Point vector address registers to data in scratchpad  Perform vector operation  Move data from scratchpad  memory 21

Example #1: Vector * Constant int data[128] = { 0, 1, 2, 3, 4, 5,..., 127 }; int multiplier = 3;  Allocate vectors in scratchpad int *vector_data; vector_data = vegas_malloc( 128*4 ); // 128 words long, in scratchpad  Move data from memory  scratchpad vegas_dma_to_vector( vector_data, data, 128*4 );// copy from ‘data’  Point vector address registers to data in scratchpad vegas_set( VADDR, V1, vector_data ); // can use V1.. V7 address reg. vegas_set( VCTRL, VL, 128 ); // # of elements  Perform vector operation  Move data from scratchpad  memory 22

Example #1: Vector * Constant int data[128] = { 0, 1, 2, 3, 4, 5,..., 127 }; int multiplier = 3;  Allocate vectors in scratchpad int *vector_data; vector_data = vegas_malloc( 128*4 ); // 128 words long, in scratchpad  Move data from memory  scratchpad vegas_dma_to_vector( vector_data, data, 128*4 );// copy from ‘data’  Point vector address registers to data in scratchpad vegas_set( VADDR, V1, vector_data ); // can use V1.. V7 address reg. vegas_set( VCTRL, VL, 128 ); // # of elements  Perform vector operation vegas_wait_for_dma(); // wait for DMA copy to finish vegas_vsw( VMULLO, V1, V1, multiplier ); // only 1 VEGAS instruction  Move data from scratchpad  memory 23

Example #1: Vector * Constant int data[128] = { 0, 1, 2, 3, 4, 5,..., 127 }; int multiplier = 3;  Allocate vectors in scratchpad int *vector_data; vector_data = vegas_malloc( 128*4 ); // 128 words long, in scratchpad  Move data from memory  scratchpad vegas_dma_to_vector( vector_data, data, 128*4 );// copy from ‘data’  Point vector address registers to data in scratchpad vegas_set( VADDR, V1, vector_data ); // can use V1.. V7 address reg. vegas_set( VCTRL, VL, 128 ); // # of elements  Perform vector operation vegas_wait_for_dma(); // wait for DMA copy to finish vegas_vsw( VMULLO, V1, V1, multiplier ); // only 1 VEGAS instruction  Move data from scratchpad  memory vegas_instr_sync(); // wait for all VEGAS instr vegas_dma_to_host( data, vector_out, 128*4 ); // copy results back vegas_wait_for_dma(); // wait for DMA copy to finish 24

Example: Brighten Screen  RGB packed into 16-bits (5-6-5) for(y = 0; y < MAX_Y_PIXELS; y++){ pPixel = getPixelAddr(0,y); for(x = 0; x < MAX_X_PIXELS; x++){ colour = *pPixel; r = (colour >> 10) & 0x3E; g = (colour >> 5) & 0x3F; b = (colour << 1) & 0x3E; r = min(r+2,62); g = min(g+2,63); b = min(b+2,62); colour = (r >1); *pPixel++ = colour; } 25

Designing for VEGAS  Brighten one row of pixels at a time  Move row into scratchpad  Process data Separate into R, G, and B vectors Add 2 to each Check for overflow  Move data back to main memory  See vegas_demo1.c in hw files on website 26

Setting up vectors/address registers  Pointers point to vectors in scratchpad unsigned short *vR; unsigned short *vG; unsigned short *vB;  Malloc allocates space for the vector vR = vegas_malloc(MAX_X_PIXELS*sizeof(unsigned short)); vG = vegas_malloc(MAX_X_PIXELS*sizeof(unsigned short)); vB = vegas_malloc(MAX_X_PIXELS*sizeof(unsigned short));  Address registers get set to pointers vegas_set(VCTRL,VL,MAX_X_PIXELS); vegas_set(VADDR,V1,vR); vegas_set(VADDR,V2,vG); vegas_set(VADDR,V3,vB); 27

Transferring data to the scratchpad for(y = 0; y < MAX_Y_PIXELS; y++){  DMA transfer line to scratchpad pLine = getPixelAddr(0,y); vegas_dma_to_vector(vR, pLine, MAX_X_PIXELS*sizeof(unsigned short));  Wait until finished before processing vegas_wait_for_dma(); 28

Process data (part 1)  Data in R. Separate R,G,B vegas_svh(VSLL,V3,1,V1); //b = line << 1; vegas_svh(VSRL,V2,5,V1); //g = line >> 5; vegas_svh(VSRL,V1,10,V1); //r = line >> 10; vegas_vsh(VAND,V3,V3,0x3E); //b = b & 0x3E; vegas_vsh(VAND,V2,V2,0x3F); //g = g & 0x3F; vegas_vsh(VAND,V1,V1,0x3E); //r = r & 0x3E;  svh means ‘scalar-vector halfword’ vs means ‘vector-scalar’, vv ‘vector-vector’ h=halfword, b=byte, w=word  VSLL/VSRL are opcodes Some have an unsigned variant ending in U  Destination, Source A, Source B 29

Process data (part 2)  Add two and check for overflow vegas_vsh(VADD,V3,V3,2); //b = b + 2; vegas_vsh(VADD,V2,V2,2); //g = g + 2; vegas_vsh(VADD,V1,V1,2); //r = r + 2; vegas_vsh(VMIN,V3,V3,62); //b = min(b,62); vegas_vsh(VMIN,V2,V2,63); //g = min(g,63); vegas_vsh(VMIN,V1,V1,62); //r = min(r,62);  Merge back into packed RGB form vegas_svh(VSRL,V3,1,V3); //b = b >> 1 vegas_svh(VSLL,V2,5,V2); //g = g << 5 vegas_svh(VSLL,V1,10,V1); //r = r << 10 vegas_vvh(VOR,V3,V3,V2); //b = b | g vegas_vvh(VOR,V3,V3,V1); //b = b | r 30

Transfer back to main memory  Wait for vector core to finish vegas_instr_sync();  Merge back into packed RGB form vegas_dma_to_host(pLine, vB, MAX_X_PIXELS*sizeof(unsigned short));  Don’t have to wait_for_dma() until you read data 31

Advanced: Double buffering  Example starts DMA, immediately waits But vector core and DMA can be concurrent  Use two buffers Transfer to one while processing the other Switch buffers when done  See vegas_demo2.c for an example 32

33 More advanced Features  Data-dependent conditional execution Vector flag registers  Vector addressing modes Unit stride Type conversion Constant stride Source registers Destination register Flag register Vector Merge Operation

34 Example: Simple 5x5 Median Filtering Pseudocode (Bubble sort) Load the 25 pixel vectors P[0..24] For i=0 to 12 { minimum = P[i] For j=i+1 to 24 { if (P[j] < minimum) { swap (minimum, P[j]) }  Slide “window” after 1 median  Each window 5x5 values  Each value = CPU register  Repeated over entire image 1 st Window = Vector[0] 2 nd Window = Vector[1] VL = # of windows Output pixel

35 Example: Simple 5x5 Median Filtering  Bubble sort on vector registers  Vmin,Vmax to do swap  “VL” results at once! 25 rows -> 25 vector registers “VL” pixels each