Optimization of Parallel Task Execution on the Adaptive Reconfigurable Group Organized Computing System Presenter: Lev Kirischian Department of Electrical.

Slides:

Advertisements

Similar presentations

UNIT 2: Data Flow description

Advertisements

Testing Workflow Purpose

Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Dataflow Programming with MaxCompiler.

DSPs Vs General Purpose Microprocessors

CMSC 611: Advanced Computer Architecture

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

Give qualifications of instructors: DAP

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.

Altera FLEX 10K technology in Real Time Application.

Digital Signal Processing and Field Programmable Gate Arrays By: Peter Holko.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

Moving NN Triggers to Level-1 at LHC Rates Triggering Problem in HEP Adopted neural solutions Specifications for Level 1 Triggering Hardware Implementation.

Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.

02/02/20091 Logic devices can be classified into two broad categories Fixed Programmable Programmable Logic Device Introduction Lecture Notes – Lab 2.

Department of Electrical and Computer Engineering Texas A&M University College Station, TX Abstract 4-Level Elevator Controller Lessons Learned.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

1/31/20081 Logic devices can be classified into two broad categories Fixed Programmable Programmable Logic Device Introduction Lecture Notes – Lab 2.

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

Uniform Reconfigurable Processing Module for Design and Manufacturing Integration V. Kirischian, S. Zhelnokov, P.W. Chun, L. Kirischian and V. Geurkov.

Trigger design engineering tools. Data flow analysis Data flow analysis through the entire Trigger Processor allow us to refine the optimal architecture.

EENG 1920 Chapter 1 The Engineering Design Process 1.

Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.

J. Christiansen, CERN - EP/MIC

1 C.H. Ho © Rapid Prototyping of FPGA based Floating Point DSP Systems C.H. Ho Department of Computer Science and Engineering The Chinese University of.

Using co-design techniques to increase the reliability of the Electronic control System for a Multilevel Power Converter Javier C. Brook, Francisco J.

F. Gharsalli, S. Meftali, F. Rousseau, A.A. Jerraya TIMA laboratory 46 avenue Felix Viallet Grenoble Cedex - France Embedded Memory Wrapper Generation.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

CWRU EECS 317 EECS 317 Computer Design LECTURE 1: The VHDL Adder Instructor: Francis G. Wolff Case Western Reserve University.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)

Fall 2004EE 3563 Digital Systems Design EE 3563 VHSIC Hardware Description Language  Required Reading: –These Slides –VHDL Tutorial  Very High Speed.

Electrical and Computer Engineering University of Cyprus LAB 1: VHDL.

Introduction to VLSI Design – Lec01. Chapter 1 Introduction to VLSI Design Lecture # 11 High Desecration Language- Based Design.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.

1 Hardware Description Languages: a Comparison of AHPL and VHDL By Tamas Kasza AHPL&VHDL Digital System Design 1 (ECE 5571) Spring 2003 A presentation.

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

Company LOGO Final presentation Spring 2008/9 Performed by: Alexander PavlovDavid Domb Supervisor: Mony Orbach GPS/INS Computing System.

Performed by: Alexander Pavlov David Domb Instructor: Mony Orbach המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון טכנולוגי.

SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.

1 Level 1 Pre Processor and Interface L1PPI Guido Haefeli L1 Review 14. June 2002.

1 Design of an MIMD Multimicroprocessor for DSM A Board Which turns PC into a DSM Node Based on the RM Approach 1 The RM approach is essentially a write-through.

Computer Architecture Furkan Rabee

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro

Programmable Hardware: Hardware or Software?

Dynamo: A Runtime Codesign Environment

Application-Specific Customization of Soft Processor Microarchitecture

DESIGN AND IMPLEMENTATION OF DIGITAL FILTER

Implementation of IDEA on a Reconfigurable Computer

Reconfigurable Computing

Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.

Reconfigurable Computing University of Arkansas

Application-Specific Customization of Soft Processor Microarchitecture

♪ Embedded System Design: Synthesizing Music Using Programmable Logic

Presentation transcript:

Optimization of Parallel Task Execution on the Adaptive Reconfigurable Group Organized Computing System Presenter: Lev Kirischian Department of Electrical and Computer Engineering RYERSON Polytechnic University Toronto, Ontario, CANADA

Digital signal processing (DSP); Digital signal processing (DSP); High performance control & Data acquisition; High performance control & Data acquisition; Digital communication and broadcasting; Digital communication and broadcasting; Cryptography and data security; Cryptography and data security; Process modeling and simulation. Process modeling and simulation. Application of parallel computing systems for data-flow tasks for data-flow tasks

Presentation of a data-flow task in the form of a data-flow graph Data In Data Out MO 1 - MO n - Macro-operators, e.g. digital filtering, FFT, matrix scaling, etc.

If the data-flow task is processed on conventional SISD architecture – processing time often cannot satisfy specification requirements; If the data-flow task is processed on conventional SISD architecture – processing time often cannot satisfy specification requirements; If the task is processed on SIMD or MIMD architectures - cost-effectiveness of these parallel computers strongly depend on the task algorithm or data structure. If the task is processed on SIMD or MIMD architectures - cost-effectiveness of these parallel computers strongly depend on the task algorithm or data structure. One of possible solutions to reach required cost-performance requirements is to develop a custom computing system where architecture covers data-flow graph of the task. One of possible solutions to reach required cost-performance requirements is to develop a custom computing system where architecture covers data-flow graph of the task. Correspondence between task and computing system architecture computing system architecture

1. Decrease of performance if task algorithm or data structure changes 1. Decrease of performance if task algorithm or data structure changes 2. No possibility for further modernization 2. No possibility for further modernization 3. High cost for multi-task or multi-mode custom computing systems. 3. High cost for multi-task or multi-mode custom computing systems. Limitations for the custom computing systems with fixed architecture

One of possible solutions – Reconfigurable parallel computing systems 1. Ability for custom configuration of each processing (functional) unit for a specific macro-operator (functional) unit for a specific macro-operator 2. Ability for custom configuration of information links between functional units; between functional units; The above features allow hardware customization for any data-flow graph and reconfiguration when task processing is completed.

Example of FPGA-based system with architecture configured for the data-flow task

Concept of Group Processor in the reconfigurable computing system Group Processor (GP) – a group of computing resources dedicated for the task and configured to reflect the task requirements. Group Processor (GP) – a group of computing resources dedicated for the task and configured to reflect the task requirements.

Group processor life- cycle 1. In the GP -links and functional units are configured before task processing configured before task processing 2. GP performs the task as long as it is necessary without interruption or time sharing with any without interruption or time sharing with any other task other task 3. After task completion all resources included in the GP can be reconfigured for any other task. the GP can be reconfigured for any other task.

The concept of Reconfigurable Group Organized computing system Host PC Virtual Bus Reconfigurable Interface Module (RIM) Functional Unit (FU) Reconfigurable Interface Module (RIM) Reconfigurable Interface Module (RIM) Functional Unit (FU) Data Stream I/O Input / Output data bus Configuration Bus

GP 2GP1 for Task 1 Virtual Bus Data in #2 FU 3FU 2FU 1FU 4 Data out #2 I/O Data in #1 Data out #1 Data out #3 GP 3 Parallel processing of different tasks on the separated Group Processors

MultiplierAdderFilter Data inMemory T0 T1 T2 TIME Concept of adaptation of the Group Processor architecture on the task architecture on the task Architecture-to-task adaptation for the GP = selection of resources configuration which: selection of resources configuration which: satisfies all requirements for task processing satisfies all requirements for task processing (e.g. performance, data throughput, reliability, etc.) (e.g. performance, data throughput, reliability, etc.) requires minimal hardware (I.e. logic gates) requires minimal hardware (I.e. logic gates)

Virtual Hardware Objects - the resource base of reconfigurable computing system of reconfigurable computing system For FPGA-based systems all architecture components (resources) can be presented as Virtual Hardware Objects (VHOs) described in one of the hardware description languages (for example VHDL or AHDL) For FPGA-based systems all architecture components (resources) can be presented as Virtual Hardware Objects (VHOs) described in one of the hardware description languages (for example VHDL or AHDL) Each resource can be presented in different variants – Ri,j, where i – indicates the type of resource (adder, multiplier, interface module, etc.) and j- indicates variant of resource presentation in the architecture (for example: 8-bit adder, 16-bit adder, etc.).

Concept of Architecture Configuration Graph (ACG) Adder Multiplier Adder Bus

Architecture Configurations Graph arrangement Local arrangement of Local arrangement of variants for each type of variants for each type of system resources system resources Adder 40 nS20 nS Processing time Architecture graph partial arrangement requires two procedures: 1. Local arrangement and procedures: 1. Local arrangement and 2. Hierarchic arrangement 2. Hierarchic arrangement

Multiplier Adder 80nS 40nS 20nS nS Adder Multiplier nS20nS nS 80nS Hierarchical arrangement of system resources Arrangement criteria - K(Ri ) = [ T max(Ri) - Tmin (Ri) ] / (m i - 1) K(Mult)= =30 > K(Adder)= =

Multiplier Adder 80nS 40nS 20nS nS 40nS Selection of Group Processor architecture based on the arranged ACG Required processing time for the task Y = A* X + B is T < 80 nS Required performance GP-architecture = = Multiplier (#2) + Adder (#1) GP-architecture = = Multiplier (#2) + Adder (#1)

Number of experiments for GP-architecture selection N (GP opt )= ( n + 1 ) + log 2 (m 1 * m 2 *...m n ) n - number of resources (VHO) included in the n - number of resources (VHO) included in the architecture of the Group Processor architecture of the Group Processor m i - number of variants of each type of resources m i - number of variants of each type of resources Example: If n = 16 and m 1 = m 2 = … m n = 32 Total number of experiments (task run on estimated GP-architecture) N (GP opt) = *5 = 97

Self-adaptation mechanism for FPGA-based reconfigurable data-flow computing systems Reconfigurable platform Data Source Performance Analyzer Host - PC Architecture generator Architecture generator Configuration Bus Library of Virtual Objects Hardware Objects Library of Virtual Objects Hardware Objects Architecture Selector Architecture Selector

First prototype of Adaptive Reconfigurable Group Organized (ARGO) computing platform

Input Data Streem - MPEG 2 Synchro-Signal Detect PCR - detectio n Null-packet analysis & removing Output frequency adjustment PCR re-stamping Reference Frequency Data Flow Graph for DVB MPEG2 processing Output MPEG 2 data stream

Architecture selection time for 6-mode DVB MPEG 2 stream processor 1. Average time for each architecture configuration mS 2. Average time for GP-architecture selection (for the specific mode) mS (for the specific mode) mS 3.Total time for architecture selections for all modes S

Input Data -MPEG 2 stream Synchro-Signal Detect PCR - detectio n Null-packet analysis & removing Output frequency adjustment PCR re-stamping Reference Frequency Output MPEG 2 data stream FU #1 (8 bit In- port) Virtual bus (16 lines) FU # 1 FU # 2 FU #2 Out-port Hardware implementation of DVB MPEG 2 stream processor for mode 1 and 4

Input Data -MPEG 2 stream Synchro-Signal Detect PCR - detectio n Null-packet analysis & removing Output frequency adjustment PCR re-stamping Reference Frequency Output MPEG 2 data stream FU #1 (8 bit In- port) Virtual bus (16 lines) FU # 1 FU # 2 FU # 3 Out-port FU # 3 Hardware implementation of DVB MPEG stream processor for modes 2, 3, 5 and 6

Summary 1. Adaptive Reconfigurable Group Organized (ARGO) parallel computing system - FPGA-based configurable system with ability for adaptation on the task algorithm / data structure. 2. ARGO -system allows parallel processing of different data-flow tasks on the dynamically configured Group Processors (GPs), where each GP-architecture configuration corresponds to the algorithm / data specifics of the task assigned to this processor. 3. Above principles allows development of cost-effective parallel computing systems with programmable performance and reliability with minimum cost of hardware components and development time.