Scalable Multi-core Sonar Beamforming with Computational Process Networks Motivation Sonar beamforming requires significant computation and input/output.

Slides:



Advertisements
Similar presentations
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Doc.: IEEE /0630r0 Submission May 2015 Intel CorporationSlide 1 Verification of IEEE ad Channel Model for Enterprise Cubical Environment.
CS CS 5150 Software Engineering Lecture 19 Performance.
CS 501: Software Engineering Fall 2000 Lecture 19 Performance of Computer Systems.
Based on Silberschatz, Galvin and Gagne  2009 Threads Definition and motivation Multithreading Models Threading Issues Examples.
1 CS 501 Spring 2005 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.
1 Dong Lu, Peter A. Dinda Prescience Laboratory Computer Science Department Northwestern University Virtualized.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
הנדסת חשמל ומחשבים Decoupling Feeding Network for Antenna Arrays Student: Eli Rivkin Supervisor: Prof. Reuven Shavit הפקולטה למדעי ההנדסה Faculty of Engineering.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Introduction SNR Gain Patterns Beam Steering Shading Resources: Wiki:
Voicu Groza, 2008 SITE, HARDWARE/SOFTWARE CODESIGN OF EMBEDDED SYSTEMS Hardware/Software Codesign of Embedded Systems Voicu Groza SITE Hall, Room.
1 Design of an SIMD Multimicroprocessor for RCA GaAs Systolic Array Based on 4096 Node Processor Elements Adaptive signal processing is of crucial importance.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Low-Power Wireless Sensor Networks
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
1 BIEN425 – Lecture 8 By the end of the lecture, you should be able to: –Compute cross- /auto-correlation using matrix multiplication –Compute cross- /auto-correlation.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
1 CS 501 Spring 2006 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Developing fast clock source with deterministic jitter Final review – Part A Yulia Okunev Supervisor -Yossi Hipsh HS-DSL Laboratory, Dept. of Electrical.
A Novel technique for Improving the Performance of Turbo Codes using Orthogonal signalling, Repetition and Puncturing by Narushan Pillay Supervisor: Prof.
An Optoelectronic Neural Network Packet Switch Scheduler K. J. Symington, A. J. Waddie, T. Yasue, M. R. Taghizadeh and J. F. Snowdon.
NA62 Trigger Algorithm Trigger and DAQ meeting, 8th September 2011 Cristiano Santoni Mauro Piccini (INFN – Sezione di Perugia) NA62 collaboration meeting,
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Initial Performance Results of the APS P0 (Transverse Bunch-to-Bunch) Feedback System N. DiMonte#, C.-Y. Yao, Argonne National Laboratory, Argonne, IL.
LIST OF EXPERIMENTS USING TMS320C5X Study of various addressing modes of DSP using simple programming examples Sampling of input signal and display Implementation.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Modeling Mobile-Agent-based Collaborative Processing in Sensor Networks Using Generalized Stochastic Petri Nets Hongtao Du, Hairong Qi, Gregory Peterson.
Processor Architecture
Scalable and Coordinated Scheduling for Cloud-Scale computing
1 CS 501 Spring 2003 CS 501: Software Engineering Lecture 23 Performance of Computer Systems.
GPS Computer Program Performed by: Moti Peretz Neta Galil Supervised by: Mony Orbach Spring 2009 Part A Presentation High Speed Digital Systems Lab Electrical.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
1 Design of an MIMD Multimicroprocessor for DSM A Board Which turns PC into a DSM Node Based on the RM Approach 1 The RM approach is essentially a write-through.
Background Computer System Architectures Computer System Software.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Accurate WiFi Packet Delivery Rate Estimation and Applications Owais Khan and Lili Qiu. The University of Texas at Austin 1 Infocom 2016, San Francisco.
Research and Service Support Resources for EO data exploitation RSS Team, ESRIN, 23/01/2013 Requirements for a Federated Infrastructure.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 4: Threads.
Chapter 4: Threads.
Core i7 micro-processor
FM Halftoning Via Block Error Diffusion
Boyang Peng, Le Xu, Indranil Gupta
Introduction to cosynthesis Rabi Mahapatra CSCE617
Chapter 4: Threads.
Chapter 4: Threads.
Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.
Support for ”interactive batch”
Energy Efficient Scheduling in IoT Networks
CSE8380 Parallel and Distributed Processing Presentation
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Hybrid Programming with OpenMP and MPI
Multithreaded Programming
Chapter 4: Threads & Concurrency
Deadlock Detection for Distributed Process Networks
Presentation transcript:

Scalable Multi-core Sonar Beamforming with Computational Process Networks Motivation Sonar beamforming requires significant computation and input/output Beamforming is traditionally done with custom hardware We would like to use inexpensive commodity computer hardware To achieve real time performance a parallel implementation is required OpenMP and other fork and join models do not scale as well as we would like We use Computational Process Networks for more scalability This allows more efficient use of current multi-core computer hardware Computational Process Networks (CPN) Kahn Process Networks are a formal model of concurrency This model provides provable deterministic behavior, but is unbounded Processes and queues are represented by a directed graph The directed graph is similar to the block diagram of the system CPN is a model and framework for high-throughput signal processing CPN uses Parks’ bounded scheduling of process networks CPN has enhancements for high performance: multi-token transactions, multi- channel queues and firing thresholds The CPN framework exploits both SMP and cluster parallelism CPN available at Sonar Beamforming A beamformer is a spatial filter to steer an array in a desired direction Beamforming is often implemented as a weighted delay-and-sum of sensors Delays are the distance to a plane perpendicular to the steering direction This array is cylindrical with 12 vertical elements at each horizontal position There are 256 horizontal positions regularly spaced around a circle The horizontal gaps provide space for mechanical structures Algorithm Inputs to the beamformer are complex basebanded 16 bit elements The beamformer is separated into vertical and horizontal components The vertical beamformer produces three sets of vertical output beams The vertical beamformer is implemented as a four tap FIR filter Three horizontal beamformers concurrently produce the final beam output The horizontal beamformer uses circular convolution with an FFT Geometric symmetry is exploited to reduce the number of calculations Top view of half the array, with projections onto a plane for steering Beamformer block diagram Calculation for vertical beamformer Calculation for horizontal beamformer John F. Bridgman, III, Gregory E. Allen and Brian L. Evans Applied Research Laboratories and Dept. of Electrical and Computer Engineering The University of Texas at Austin, Austin, Texas Simulated beam pattern Steps of the horizontal beamformer Average throughput versus number of cores Beamformer realization in CPN Implementation The horizontal kernel uses FFTW, horizontal and vertical kernels use SSE3 Each kernel uses OpenMP internally for data parallelism We run tests on 2.4GHz Intel dual quad core Nehalem processors with Hyper-Threading We use RedHat Enterprise Linux Server 5.5 and GCC We enable an increasing number of cores to evaluate scalability for several cases OpenMP provides “active” (busy wait, the default) and “passive” (OS assisted) waiting We compare the system composed with OpenMP to the system composed with CPN We measure throughput in samples per second of the entire system This work was supported by the Independent Research and Development Program at Applied Research Laboratories: The University of Texas at Austin. Default OpenMP settings (“active”) hinders performance in both cases The plateau is caused by transition to Hyper-Threaded cores CPN version is 13.2% faster than OpenMP-only version at 8 cores At the peak, the CPN version operates at 27.3 GFLOPS CPN framework increases beamformer scalability and performance The CPN framework can trivially provide a distributed implementation Results