Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

Slides:

Advertisements

Similar presentations

Advertisements

PIPELINE AND VECTOR PROCESSING

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Convey Computer Status Steve Wallach swallach”at”conveycomputer.com.

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

PZ13A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ13A - Processor design Programming Language Design.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Midterm Wednesday Chapter 1-3: Number /character representation and conversion Number arithmetic Combinational logic elements and design (DeMorgan’s Law)

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

The fetch-execute cycle. 2 VCN – ICT Department 2013 A2 Computing RegisterMeaningPurpose PCProgram Counter keeps track of where to find the next instruction.

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Fundamentals of Programming Languages-II

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.

M211 – Central Processing Unit

Vector computers.

1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Types of Micro-operation  Transfer data between registers  Transfer data from register to external  Transfer data from external to register  Perform.

System-on-Chip Design Homework Solutions

Protection in Virtual Mode

Assembly language.

Chapter 10: Computer systems (1)

Computer Science 210 Computer Organization

Micro-programmed Control

Embedded Systems Design

Assembly Language for Intel-Based Computers, 5th Edition

Morgan Kaufmann Publishers The Processor

Architecture & Organization 1

Morgan Kaufmann Publishers

CS1251 Computer Architecture

Morgan Kaufmann Publishers

Prof. Zhang Gang School of Computer Sci. & Tech.

Micro-programmed Control Unit

COMP4211 : Advance Computer Architecture

Computer Science 210 Computer Organization

Central Processing Unit

Figure 8.1 Architecture of a Simple Computer System.

Computer Architecture

Pipelining and Vector Processing

Array Processor.

Linchuan Chen, Peng Jiang and Gagan Agrawal

Architecture & Organization 1

Out-of-Order Commit Processor

Morgan Kaufmann Publishers Computer Organization and Assembly Language

Simulation of computer system

Control Unit Introduction Types Comparison Control Memory

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Introduction and History of Cray Supercomputers

Processor Organization and Architecture

Md. Mojahidul Islam Lecturer Dept. of Computer Science & Engineering

Md. Mojahidul Islam Lecturer Dept. of Computer Science & Engineering

William Stallings Computer Organization and Architecture 8th Edition

Computer Architecture

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

ECE 352 Digital System Fundamentals

CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang

Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Modified from notes by Saeid Nooshabadi

Course Outline for Computer Architecture

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

William Stallings Computer Organization and Architecture

COMPUTER ORGANIZATION AND ARCHITECTURE

Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Chapter 4 The Von Neumann Model

Presentation transcript:

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra SPARSE MATRIX VECTOR MULTIPLICATION

SPARSE MATRICES WHY CONVENTIONAL ALGORITHMS NOT EFFICIENT FOR SPARSE MATRICES? WHAT ARE THEY ? WHERE ARE THEY USED ? Sparse Matrices are when systems are modelled into large differential equations Typical domains are Image processing , Industrial process simulations, Data retrieval Processing of Sparse matrices require large processing time There is a huge overhead due to storing redundant elements Simply, Matrices with a large number of zero elements

BASICS OF SPARSE MATRICES Compressed Sparse Row / Column to Matrix Market

FORMAT INDEPENDENCE

MOTIVATION for ALTERNATE STRATEGIES Low Memory Bandwidth Irregular memory access patterns High latency of load/store instructions High Ratio of Load/Store Instructions

CONVEY - A QUICK LOOK INSIDE The AEH runs scalar instructions and routes memory requests from AE 8 Memory Controllers enable parallel and pipelined access to memory 256 MB Coherent Cache for memory requests from coprocessor to host memory It has 4 FPGAs for user defined Application Personalities as well !!!

Details of the C Code SEQUENTIAL PROCESSOR AEH AE1 AE2 AE3 AE4 A8 A9 MB1 MB2 MB3 HOST PROCESSOR POPULATES INPUT MATRICES COP_CALL ROUTINE PASSES THE BASE ADDRESS TO COPROCESSOR Memory allocated for array 1 from mem_base 1 Memory allocated for array 2 from mem_base 2 Memory allocated for result from mem_base 3

Details of the Assembly Code AEH AE1 MB1 MB2 AEG AEG 1 AEG 2 AEG 31 MB3 USE ASSEMBLY TO MOVE BASE ADDRESSES TO APPLICATION ENGINE REGISTERS Logical operations – AND,OR,XOR Arithmetic Operations- Multiplication, Addition Complex calculations involving vectors could be done without writing VHDL code MAIN MEMORY

MEMORY INTERFACING MAIN MEMORY OUR MODULE A 0 A 1 DATA ADDRESS POP DATA VALID REQ ID ROQ 0 MC 0 ID 0 ID 1 ID 2 ID 3 ID 255 ID 4 I &D 4 I &D 3 I &D 1 I &D 2 I &D 0 MAIN MEMORY D 0 D 1

IMPLEMENTATION WE NOW HAVE THE REQUIRED INPUTS FOR SMVM IN THIS WAY, WE WRITE ALL 11 OUTPUTS TO MEMORY IN THIS WAY, WE DO 21 READS FROM THE DATA BUS AFTER PROCESSING, THE SMVM GIVES A DONE SIGNAL ONE CYCLE OF COMPUTATION IS COMPLETE !!! GENERATE LD SIGNAL GIVE BASE ADDRESS GENERATE 21 LOAD SIGNALS LOAD COMPLETE SIGNAL MASTER CONTROL ADD. DECODER MCs, ROQs AND MEMORY 0X454C…..400 0X454C…..040 DATA READ ENGINE DATA VALID START READ DATA BUS READ COMPLETE INPUT BUFFER INPUT BUFFER START WRITE OUTPUT BUFFER DONE START SMVM OUTPUT BUFFER

Simulation Results - Co-Processor Instruction Execution Base Address & Size values moved to internal Registers Decode Move Instruction ( 6 Move Instructions ) Decode CAEP Instruction Start’s Custom Personality

Simulation Results – Load Request to MC Starts Read Procedure ID from ROQ With Load Request Append ID from ROQ Start Read Process after send request to MC Check Address Send Load Request to respective MC Decoded Address Send 21 Data Load Requests

Receive Data from MC through ROQ Simulation Results – Receive Data from MC through ROQ Start Load Process Valid Data Available at MC’s Start Next Read ( if nothing to Write) But, Read Data Sequentially from MC0 – MC1 – MC2 Load Process done after receiving 21 Data Inputs

Write Back Results from SpMV- Engine using MC Simulation Results – Write Back Results from SpMV- Engine using MC Write Process Done after 11 store operations Start Write if valid data received from SpMV Engine Decode Address Send Store Request to respective MCs Write Process Done and Start next read cycle

FUTURE SCOPE Increasing memory bandwidth Partitioning SMVM calculations across four Application Engines