Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

Slides:

Advertisements

Similar presentations

Analysis of Computer Algorithms

Advertisements

Optimization of Parallel Task Execution on the Adaptive Reconfigurable Group Organized Computing System Presenter: Lev Kirischian Department of Electrical.

Department of Electronic Engineering NUIG Direct Evolution of Patterns using Genetic Algorithms By: John Brennan Supervisor: John Maher.

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

ARCHITECTURES FOR ARTIFICIAL INTELLIGENCE SYSTEMS

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

Using Parallel Genetic Algorithm in a Predictive Job Scheduling

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Genetic Algorithms Representation of Candidate Solutions GAs on primarily two types of representations: –Binary-Coded –Real-Coded Binary-Coded GAs must.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Reference: Message Passing Fundamentals.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

Pipelining By Toan Nguyen.

Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.

Ontogenetic systems Drawing inspiration from growth and healing processes of living organisms… …and applying them to electronic computing systems Phylogeny.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Chapter 3 Memory Management: Virtual Memory

Machine Learning. Learning agent Any other agent.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Soft Computing Lecture 18 Foundations of genetic algorithms (GA). Using of GA.

Intro to Architecture – Page 1 of 22CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Introduction Reading: Chapter 1.

Computer Architecture and Organization Introduction.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Automated Design of Custom Architecture Tulika Mitra

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Architectural Design Yonsei University 2 nd Semester, 2014 Sanghyun Park.

Evolving Virtual Creatures & Evolving 3D Morphology and Behavior by Competition Papers by Karl Sims Presented by Sarah Waziruddin.

1 “Genetic Algorithms are good at taking large, potentially huge search spaces and navigating them, looking for optimal combinations of things, solutions.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)

Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.

EE3A1 Computer Hardware and Digital Design

1 Control Unit Operation and Microprogramming Chap 16 & 17 of CO&A Dr. Farag.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

1 Genetic Algorithms K.Ganesh Introduction GAs and Simulated Annealing The Biology of Genetics The Logic of Genetic Programmes Demo Summary.

Automated Patch Generation Adapted from Tevfik Bultan’s Lecture.

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

09/20/04 Introducing Proteins into Genetic Algorithms – CSIMTA'04 Introducing “Proteins” into Genetic Algorithms Virginie LEFORT, Carole KNIBBE, Guillaume.

“Politehnica” University of Timisoara Course No. 3: Project E MBRYONICS Evolvable Systems Winter Semester 2010.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Ontogenetic hardware Ok, so the Tom Thumb algorithm can self- replicate an arbitrary structure within an FPGA But what kind of structures is it interesting.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

Genetic Algorithms. Solution Search in Problem Space.

Reconfigurable Computing1 Reconfigurable Computing Part II.

Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.

Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.

Presented By: Farid, Alidoust Vahid, Akbari 18 th May IAUT University – Faculty.

Computer Organization and Architecture Lecture 1 : Introduction

Dynamo: A Runtime Codesign Environment

COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE

Tohoku University, Japan

Embedded Systems Design

Programming Languages

Introduction to cosynthesis Rabi Mahapatra CSCE617

CSCI1600: Embedded and Real Time Software

Ontogenetic hardware Ok, so the Tom Thumb algorithm can self-replicate an arbitrary structure within an FPGA But what kind of structures is it interesting.

Mapping DSP algorithms to a general purpose out-of-order processor

Traveling Salesman Problem by Genetic Algorithm

CSCI1600: Embedded and Real Time Software

Chapter 4 The Von Neumann Model

Presentation transcript:

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design the custom processors Step 3: program the FPGA Step 4: assign the tasks to the processors and set up the connection network ← Multi-cellular organization ← ??? ← Growth (cellular division)

Development in hardware – Why? Step 2: as a function of the tasks, design one (or more) custom processors. ×+÷≠ FFT + × DCT ×+÷≠ FFT + × IN DCT OUT

Cellular differentiation Cells adapt their physical structure to fit the “application” Can circuits/processors do the same? Physically? No Logically? Yes, but… Can they do it easily (dare we say, automatically)?

Cellular differentiation Needed: adaptable cellular architecture That is, a processor architecture that is Customizable Compact Powerful Easy to design and modify Amenable to evolution and learning Possible solution: MOVE architectures

The MOVE paradigm One single instruction : move Data displacements trigger operations Architecture based around data ≠ operation centric Regular structure : functional units + data network Scalable and modular architecture Example: Sum of two values Conventional architecture: add R1, R2, R3; MOVE architecture: move O(Fxxx), I1(Fsum) move O(Fyyy), I2(Fsum) move O(Fsum), I(Fzzz)

Cellular differentiation Main features: Conventional fetch/decode mechanism – compatible with bio-inspired mechanisms No pipeline: computation carried out in specialized functional units (FU) Communication carried out in specialized communication units (CU) Only one instruction that MOVEs data to and from the CUs and FUs (dataflow architecture)

Cellular differentiation Main advantages: Can be easily customized by introducing application- specific functional and communication units. Perfectly fits the requirements of systolic arrays (arbitrarily complex communication patterns). The introduction of custom components does not affect the assembler language, the code structure, the fetch and decode units, or the transport bus.

Genotype Layer Phenotype Layer Example – Automatic Synthesis Application-specific (parallel) functions Developmental algorithm Genetic code Mapping Layer

Example – Automatic Synthesis Phenotype Layer Mapping Layer Genotype Layer Totipotent Cell

Example – Automatic Synthesis Totipotent Cell Programmable Logic

Example – Automatic Synthesis Programmable Logic Cellular Array

Implementation - The BioWall

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design the custom processors Step 3: program the FPGA Step 4: assign the tasks to the processors and set up the connection network ← Multi-cellular organization ← ??? ← Cell specialization ← Growth (cellular division)

Phenotype Layer Cell design and specialization Application code (parallel) Within a MOVE framework, the specialization (differentiation) of a cell corresponds to the selection of the functional and communication units that can most efficiently implement the desired application.

FU extraction Extracting the optimal FUs from the code is a complex problem!

FU extraction How about having a quick peek at biology? Idea: let us use evolution!! In fact, this approach is much closer to biology than simply evolving code: in nature, the hardware (the cell) and the software (the genome) have evolved together!

FU extraction Idea: let us use evolution!!

FU extraction First step: profiling the code (standard compilation technique)

FU extraction Second step: transform into tree (standard compilation technique) Third step: represent as 1-D genome Fourth step: run the GA (with some fancy optimizations)

Fitness evaluation s = size of the new processor t = execution time of the program on the new processor α = execution time of the program on a minimal processor β = hardware area to implement the minimal processor (which has, by definition, a fitness of 1) hwLimit = maximum hardware allowed to implement the new processor Note: Relative fitness function When out of allowed hardware range, logarithmic decrease The hardware investment has to be small enough to be retained

Determining hardware size How can the size of the new FU estimated (the β parameter of the fitness) ? The idea: Determine the size of each basic building block ( +, -AND, …) What to do with assignments or loops ? Compute how many of them are used for a new FU The characterization has to be done for every target platform.

Determining hardware execution time Use the same idea used for size : Compute the time needed for each elementary function Take targeted clock period as a basis When time estimated > clock period, add 1 to the total time  small jumps in the fitness landscape

Pattern-matching optimization How to find reusable FUs ? The GA behaves a bit like random mutations  difficult to find reusability this way Helps the GA a bit : search the whole tree each time a new HW block is defined to replace similar pieces of code

Non-optimal block pruning “Cleaning” phase made at each step Removes HW blocks that are non- optimal from the fitness point-of-view To see if a block is useful, compute the fitness with and without this block implemented in HW. If the software solution has a better fitness, the block is non-optimal and can be removed.

FU extraction - Interface STANDARDDOMAIN

FU extraction - Results Example (functions from FACT factorization algorithm): Hardware increase (estimated): 10% (fixed) Speedup (estimated): 2.27 (227%) Other results: All were obtained in a few seconds