ESC499 – A TMD-MPI/MPE B ASED H ETEROGENEOUS V IDEO S YSTEM Tony Zhou, Prof. Paul Chow April 6 th, 2010.

Slides:



Advertisements
Similar presentations
Multiprocessor Architecture for Image processing Mayank Kumar – 2006EE10331 Pushpendre Rastogi – 2006EE50412 Under the guidance of Dr.Anshul Kumar.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Threads, SMP, and Microkernels
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Distributed Shared Memory
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Computer Systems/Operating Systems - Class 8
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Introduction CS 524 – High-Performance Computing.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
A Scalable FPGA-based Multiprocessor for Molecular Dynamics Simulation Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1,
CS 584. A Parallel Programming Model We need abstractions to make it simple. The programming model needs to fit our parallel machine model. Abstractions.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Chapter 17 Parallel Processing.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.
Mapping Techniques for Load Balancing
Slide 1-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 1.
Parallel Architectures
What is Concurrent Programming? Maram Bani Younes.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
Synchronization and Communication in the T3E Multiprocessor.
Impulse Embedded Processing Video Lab Generate FPGA hardware Generate hardware interfaces HDL files HDL files FPGA bitmap FPGA bitmap C language software.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Distributed-Memory Programming Using MPIGAP Vladimir Janjic International Workhsop “Parallel Programming in GAP” Aug 2013.
Design and Characterization of TMD-MPI Ethernet Bridge Kevin Lam Professor Paul Chow.
1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
(More) Interfacing concepts. Introduction Overview of I/O operations Programmed I/O – Standard I/O – Memory Mapped I/O Device synchronization Readings:
Department of Computer Science University of the West Indies.
Processes and OS basics. RHS – SOC 2 OS Basics An Operating System (OS) is essentially an abstraction of a computer As a user or programmer, I do not.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
A Methodology for Architecture Exploration of heterogeneous Signal Processing Systems Paul Lieverse, Pieter van der Wolf, Ed Deprettere, Kees Vissers.
A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.
Operating System Principles And Multitasking
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Multiprocessor Systems Using FPGAs Presented By: Manuel Saldaña Connections 2006 The University of Toronto ECE Graduate Symposium.
Fundamentals of Programming Languages-II
1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.
System Architecture Directions for Networked Sensors.
1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.
Dr D. Greer, Queens University Belfast ) Software Engineering Chapter 7 Software Architectural Design Learning Outcomes Understand.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Virtual memory.
Overview Parallel Processing Pipelining
Parallel Programming By J. H. Wang May 2, 2017.
Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch
by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow
Presentation transcript:

ESC499 – A TMD-MPI/MPE B ASED H ETEROGENEOUS V IDEO S YSTEM Tony Zhou, Prof. Paul Chow April 6 th, 2010

ESC499 – E NG S CI T HESIS Background The Background  Message Passing Interface (MPI): is a specification for an API that allows many computers to communicate with one another.  An API is an abstraction that defines and describes an interface for the interaction with a set of functions.  MPI has become a de facto standard for communication among processes that model a parallel program running on a distributed memory system. Prof. Paul Chow’s Research  Hardware systems are better suited for parallel processing. FPGA’s reconfigurable nature makes hardware computing engine (CE) design easy.  Similar to what MPI provides to software developers,  TMD-MPI provides software and hardware middleware layers of abstraction for communications to enable the portable interaction between embedded processors, CEs and X86 processors.

ESC499 – E NG S CI T HESIS Filling the Gap, and Defining the Scope  The TMD-MPI research is still in its infant stage compared to the MPI standard, implementation and characterization of designs are lacking.  This project attempts to fill this gap by investigating alternative approaches to present hardware and software elements.  If a simple feasible heterogeneous system was successfully demonstrated, this thesis will focus on expanding the software element network to exploit more parallelism. Software Element Hardware Element Software Element Hardware Element Software Element Hardware Element

ESC499 – E NG S CI T HESIS Objectives  The goal is to create a heterogenous video processing system that demonstrates TMD-MPI’s capabilities as the interface between CEs and software processes.  Called heterogenous due to the combination of hardware engines and software processes.  Implement and characterize different configurations of the system. Research and Groundwork Manuel Saldana’s paper “A Parallel Programming Model for a Multi-FPGA Multiprocessor Machine”  TMD-MPI Library v1.0: software MPI interface designed for Xilinx Microblaze  TMD-MPE v1.0: hardware implementation of send and receive commands of the TMD-MPI library. Jeff Goeder’s Project  Video System Groundwork: streams video from VGA port, to external memory, then to DVI-out, through MPE-MPE message passing.

System Block Diagram

ESC499 – E NG S CI T HESIS High Level Implementation The primary goal focuses on functionality rather than performance. Speed and performance considerations aside, two approaches from the high level perspective can be adopted. Distributed Memory  Distributed memory for each node  Pass the entire video as continuous messages Shared Memory  Shared memory for all the nodes  Pass only the pointer to the video in memory Computing Engine 1 Local memory Computing Engine 1 Local memory Computing Engine 1 Computing Engine 2 Shared Memory for all devices Network Traffic: (640x480 px) (32-bit/px) = 1200 KB per frame Network Traffic: 32-bit (4B) memory addresses

ESC499 – E NG S CI T HESIS Distributed Memory Distributed-memory, video streaming approach. Entry n … Entry 2 Entry 1 Video 1-10Mhz Single frame example: Multi-Frame Speed Issue: Xilinx MicroblazeV-Dec DVI out Entry n … Entry 2 Entry 1 Xilinx FSL (FIFO)  Microblaze cannot pull data off the FIFO fast enough due to several factors Xilinx FSL (FIFO)

ESC499 – E NG S CI T HESIS Microblaze PLB bus traffic Entry n … Entry 2 Entry 1 Video 1-10Mhz Entry n … Entry 2 Entry 1 FIFO  First, Xilinx FSL (FIFO) interface access time.  Second, memfory access time, bus arbitration.  Third, implicit sequential execution of instructions in a normal processor. Microblaze 100Mhz, however the speed is limited by other factors

ESC499 – E NG S CI T HESIS Shared Memory Shared-memory, address mapped tasks Single frame example:  Only 32-bit memory addresses are passed as messages between ranks. Significant reduction in network traffic (b/f: 640 x 480 x 32 bits per frame)  Multiple microblazes in parallel Inside the memory:  Each microblaze is assigned a different region in the common memory space.  Each microblaze can have its own codec (eg on left) or the same one.  Each microblaze then put its own section of frame into its corresponding place in the DVI-out memory space

ESC499 – E NG S CI T HESIS Why Software & Why Hardware ConclusionResults  The TMD-MPI approach to heterogeneous systems prove to be easy and efficient in development.  Shared memory approach significantly improves speed and is linearly scalable. SoftwareHardware FunctionalityVery goodBad PerformanceSlowVery Fast DevelopmentFastSlow Cost--  Suggestion: software-to-hardware, since TMD-MPI/MPE abstracts interface complexities away from the developer.

THANKS AND Q&A Acknowledgements: Professor Paul Chow, Sami Sadaka, Kevin Lam, Kam Pui Tang, Manuel Saldana