Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

Slides:



Advertisements
Similar presentations
Machine cycle.
Advertisements

PIPELINE AND VECTOR PROCESSING
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Convey Computer Status Steve Wallach swallach”at”conveycomputer.com.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
PZ13A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ13A - Processor design Programming Language Design.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Midterm Wednesday Chapter 1-3: Number /character representation and conversion Number arithmetic Combinational logic elements and design (DeMorgan’s Law)
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
The fetch-execute cycle. 2 VCN – ICT Department 2013 A2 Computing RegisterMeaningPurpose PCProgram Counter keeps track of where to find the next instruction.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Fundamentals of Programming Languages-II
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.
M211 – Central Processing Unit
Vector computers.
1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Types of Micro-operation  Transfer data between registers  Transfer data from register to external  Transfer data from external to register  Perform.
System-on-Chip Design Homework Solutions
Protection in Virtual Mode
Assembly language.
Chapter 10: Computer systems (1)
Computer Science 210 Computer Organization
Micro-programmed Control
Embedded Systems Design
Assembly Language for Intel-Based Computers, 5th Edition
Morgan Kaufmann Publishers The Processor
Architecture & Organization 1
Morgan Kaufmann Publishers
CS1251 Computer Architecture
Morgan Kaufmann Publishers
Prof. Zhang Gang School of Computer Sci. & Tech.
Micro-programmed Control Unit
COMP4211 : Advance Computer Architecture
Computer Science 210 Computer Organization
Central Processing Unit
Figure 8.1 Architecture of a Simple Computer System.
Computer Architecture
Pipelining and Vector Processing
Array Processor.
Linchuan Chen, Peng Jiang and Gagan Agrawal
Architecture & Organization 1
Out-of-Order Commit Processor
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Simulation of computer system
Control Unit Introduction Types Comparison Control Memory
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Introduction and History of Cray Supercomputers
Processor Organization and Architecture
Md. Mojahidul Islam Lecturer Dept. of Computer Science & Engineering
Md. Mojahidul Islam Lecturer Dept. of Computer Science & Engineering
William Stallings Computer Organization and Architecture 8th Edition
Computer Architecture
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
ECE 352 Digital System Fundamentals
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Modified from notes by Saeid Nooshabadi
Course Outline for Computer Architecture
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
William Stallings Computer Organization and Architecture
COMPUTER ORGANIZATION AND ARCHITECTURE
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Pipelining.
Chapter 4 The Von Neumann Model
Presentation transcript:

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra SPARSE MATRIX VECTOR MULTIPLICATION

SPARSE MATRICES WHY CONVENTIONAL ALGORITHMS NOT EFFICIENT FOR SPARSE MATRICES? WHAT ARE THEY ? WHERE ARE THEY USED ? Sparse Matrices are when systems are modelled into large differential equations Typical domains are Image processing , Industrial process simulations, Data retrieval Processing of Sparse matrices require large processing time There is a huge overhead due to storing redundant elements Simply, Matrices with a large number of zero elements

BASICS OF SPARSE MATRICES Compressed Sparse Row / Column to Matrix Market

FORMAT INDEPENDENCE

MOTIVATION for ALTERNATE STRATEGIES Low Memory Bandwidth Irregular memory access patterns High latency of load/store instructions High Ratio of Load/Store Instructions

CONVEY - A QUICK LOOK INSIDE The AEH runs scalar instructions and routes memory requests from AE 8 Memory Controllers enable parallel and pipelined access to memory 256 MB Coherent Cache for memory requests from coprocessor to host memory It has 4 FPGAs for user defined Application Personalities as well !!!

Details of the C Code SEQUENTIAL PROCESSOR AEH AE1 AE2 AE3 AE4 A8 A9 MB1 MB2 MB3 HOST PROCESSOR POPULATES INPUT MATRICES COP_CALL ROUTINE PASSES THE BASE ADDRESS TO COPROCESSOR Memory allocated for array 1 from mem_base 1 Memory allocated for array 2 from mem_base 2 Memory allocated for result from mem_base 3

Details of the Assembly Code AEH AE1 MB1 MB2 AEG AEG 1 AEG 2 AEG 31 MB3 USE ASSEMBLY TO MOVE BASE ADDRESSES TO APPLICATION ENGINE REGISTERS Logical operations – AND,OR,XOR Arithmetic Operations- Multiplication, Addition Complex calculations involving vectors could be done without writing VHDL code MAIN MEMORY

MEMORY INTERFACING MAIN MEMORY OUR MODULE A 0 A 1 DATA ADDRESS POP DATA VALID REQ ID ROQ 0 MC 0 ID 0 ID 1 ID 2 ID 3 ID 255 ID 4 I &D 4 I &D 3 I &D 1 I &D 2 I &D 0 MAIN MEMORY D 0 D 1

IMPLEMENTATION WE NOW HAVE THE REQUIRED INPUTS FOR SMVM IN THIS WAY, WE WRITE ALL 11 OUTPUTS TO MEMORY IN THIS WAY, WE DO 21 READS FROM THE DATA BUS AFTER PROCESSING, THE SMVM GIVES A DONE SIGNAL ONE CYCLE OF COMPUTATION IS COMPLETE !!! GENERATE LD SIGNAL GIVE BASE ADDRESS GENERATE 21 LOAD SIGNALS LOAD COMPLETE SIGNAL MASTER CONTROL ADD. DECODER MCs, ROQs AND MEMORY 0X454C…..400 0X454C…..040 DATA READ ENGINE DATA VALID START READ DATA BUS READ COMPLETE INPUT BUFFER INPUT BUFFER START WRITE OUTPUT BUFFER DONE START SMVM OUTPUT BUFFER

Simulation Results - Co-Processor Instruction Execution Base Address & Size values moved to internal Registers Decode Move Instruction ( 6 Move Instructions ) Decode CAEP Instruction Start’s Custom Personality

Simulation Results – Load Request to MC Starts Read Procedure ID from ROQ With Load Request Append ID from ROQ Start Read Process after send request to MC Check Address Send Load Request to respective MC Decoded Address Send 21 Data Load Requests

Receive Data from MC through ROQ Simulation Results – Receive Data from MC through ROQ Start Load Process Valid Data Available at MC’s Start Next Read ( if nothing to Write) But, Read Data Sequentially from MC0 – MC1 – MC2 Load Process done after receiving 21 Data Inputs

Write Back Results from SpMV- Engine using MC Simulation Results – Write Back Results from SpMV- Engine using MC Write Process Done after 11 store operations Start Write if valid data received from SpMV Engine Decode Address Send Store Request to respective MCs Write Process Done and Start next read cycle

FUTURE SCOPE Increasing memory bandwidth Partitioning SMVM calculations across four Application Engines