Backprojection and Synthetic Aperture Radar Processing on a HHPC Albert Conti, Ben Cordes, Prof. Miriam Leeser, Prof. Eric Miller

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

Computer Science, University of Oklahoma Reconfigurable Versus Fixed Versus Hybrid Architectures John K. Antonio Oklahoma Supercomputing Symposium 2008.
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
The 3D FDTD Buried Object Detection Forward Model used in this project was developed by Panos Kosmas and Dr. Carey Rappaport of Northeastern University.
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
A Parallel Matching Algorithm Based on Image Gray Scale Liang Zong, Yanhui Wu cso, vol. 1, pp , 2009 International Joint Conference on Computational.
Reference: Message Passing Fundamentals.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.
Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Field-Programmable Logic and its Applications INTERNATIONAL CONFERENCE August 30 – September 01, 2004 Albert A. Conti, Tom Van Court, Martin C. Herbordt.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
CS-334: Computer Architecture
Fernando Ortiz EM Photonics, Inc. Newark, DE
Computer Organization CSC 405 Bus Structure. System Bus Functions and Features A bus is a common pathway across which data can travel within a computer.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)
“SEMI-AUTOMATED PARALLELISM USING STAR-P " “SEMI-AUTOMATED PARALLELISM USING STAR-P " Dana Schaa 1, David Kaeli 1 and Alan Edelman 2 2 Interactive Supercomputing.
Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington.
Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.
1 of 22 Glaciers and Ice Sheets Interferometric Radar (GISIR) Center for Remote Sensing of Ice Sheets, University of Kansas, Lawrence, KS
Efficient FPGA Implementation of QR
Sherman Braganza, Miriam Leeser, W.C. Warger II, C.M. Warner, C. A. DiMarzio Goal Accelerate the performance of the.
1 of 23 Fouts MAPLD 2005/C117 Synthesis of False Target Radar Images Using a Reconfigurable Computer Dr. Douglas J. Fouts LT Kendrick R. Macklin Daniel.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Chapter 2 The CPU and the Main Board  2.1 Components of the CPU 2.1 Components of the CPU 2.1 Components of the CPU  2.2Performance and Instruction Sets.
FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.
Distortion Correction ECE 6276 Project Review Team 5: Basit Memon Foti Kacani Jason Haedt Jin Joo Lee Peter Karasev.
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Using FPGAs to Supplement Ray-Tracing Computations on the Cray XD-1 Charles B. Cameron United States Naval Academy Department of Electrical Engineering.
Parallelization of Classification Algorithms For Medical Imaging on a Cluster Computing System 指導教授 : 梁廷宇 老師 系所 : 碩光通一甲 姓名 : 吳秉謙 學號 :
1 Reconfigurable Acceleration of Microphone Array Algorithms for Speech Enhancement Ka Fai Cedric Yiu, Yao Lu, Xiaoxiang Shi The Hong Kong Polytechnic.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
Novel Hardware-software Architecture for Computation of DWT Using Recusive Merge Algorithm Piyush Jamkhandi, Amar Mukherjee, Kunal Mukherjee, Robert Franceschini.
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
Wang Chen, Dr. Miriam Leeser, Dr. Carey Rappaport Goal Speedup 3D Finite-Difference Time-Domain.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Acceleration of the Retinal Vascular Tracing Algorithm using FPGAs Direction Filter Design FPGA FIREBIRD BOARD Framegrabber PCI Bus Host Data Packing Design.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
Sherman Braganza, Miriam Leeser Goal Accelerate the performance of the minimum L P Norm phase unwrapping algorithm.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 An FPGA Implementation of the Two-Dimensional Finite-Difference Time-Domain (FDTD) Algorithm Wang Chen Panos Kosmas Miriam Leeser Carey Rappaport Northeastern.
VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.
Atindra Mitra Joe Germann John Nehrbass
Backprojection Project Update January 2002
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Hiba Tariq School of Engineering
Parallel Beam Back Projection: Implementation
RECONFIGURABLE PROCESSING AND AVIONICS SYSTEMS
Wavelet “Block-Processing” for Reduced Memory Transfers
Hybrid Programming with OpenMP and MPI
Interrogation windows High level PIV Architecture
Cluster Computers.
Presentation transcript:

Backprojection and Synthetic Aperture Radar Processing on a HHPC Albert Conti, Ben Cordes, Prof. Miriam Leeser, Prof. Eric Miller Synthetic Aperture Radar (SAR) is a process by which high- resolution images can be formed by processing a series of radar reflections taken by a single transceiver. Backprojection is one method for post-processing these reflections; it is a highly parallel algorithm, which makes it suitable for translation into hardware. This poster explores the difficulties involved in achieving maximum speedup from a hardware implementation on a parallel computing system, including memory bandwidth, communication bottlenecks, and others. Abstract What is SAR? SAR: Synthetic Aperture Radar Aperture (width of radar dish) directly affects the resolution of the image Many radar pulses taken and processed Aperture is synthetically increased by accumulating the results ‘Stripmap’ and ‘Spotlight’ modes For More Detail: Soumekh, M. “Synthetic Aperture Radar Signal Processing with MATLAB Algorithms”, ISBN Stripmap Mode SAR Plane flies past target in straight line Multiple radar pulses are taken at right angle to flight path Each pulse covers some portion of the target area What is Backprojection? SAR output: array of radar response ‘projections’ Filter out physical effects of radar Correlate pixels to time, index into projection data Interpolate between indices to increase accuracy Previous Work Medical Imaging Spotlight mode with backprojection Used Annapolis Firebird board Precursor to WildStar II board 65MHz clock, 16-way pipeline Previous Work: Results For More Detail: Haiqian Yu, “Memory Architecture for Data Intensive Image Processing Algorithms in Reconfigurable Hardware”, Master’s Thesis; Northeastern University, Boston MA PlatformRuntimeSpeedup 1GHz Pentium, Floating-point94s1.0x 1GHz Pentium, Fixed-point28s3.4x 50MHz WildStar I (1-way)5.37s17x 65MHz FireBird (1-way)4.13s23x 50MHz WildStar I (4-way)1.34s70x 65MHz FireBird (16-way)0.26s360x HHPC Architecture 48-node Beowulf cluster Dual 2.2GHz Xeons Linux OS Annapolis MicroSystems WildStar II FGPA boards Champ LVDS systolic interconnect Gigabit Ethernet cards Myrinet MPI cards Funded by DOD High Performance Computing Modernization Program. Grant #PET SIP-K Exploiting Parallelism Parallel operations can provide performance gains Data dependencies reduce parallelism Few dependencies exist in SAR/BP Coarse-grained Parallelism Process several projections on each system Size and available space determines ratio of projections per board Fine-grained parallelism Work for each set of projections can be pipelined and parallelized Memory bandwidth determines number of parallel pipelines Hybrid Implementation Future Optimizations Performance The hybrid implementation achieved 40X speedup over a software solution with a single node of the HHPC (no coarse-grained parallelism). Preliminary results from the parallel version of the hybrid implementation show drastic speedup in the processing stage of the algorithm, yet a slowdown in reconstruction of the final image due to inter- process communication. Currently, work is being done to analyze the optimal number of processing nodes to reconstruct images most efficiently with the HHPC. Overlap processing and communication in an effort to make use of inherent communication latency Overlap file I/O and communication to minimize end to end run time Utilize processing nodes for intermediate merging Stagger processing stages to avoid communication collisions 4. Target Images Merged PC FPGA PC FPGA PC FPGA PC FPGA PC 1. Input Data Loaded from Disc 5. Aggregate Image Stored to Disc 2. Data Broadcasted to 3. Parallel Processing Backprojection Processor Nodes In stage 1, data from the separate projections are fetched from storage and made ready to distribute amongst the processing nodes. In stage 2, the projection data is broadcasted to all of the processor nodes via Myrinet. Processor nodes listen and accept data that contributes to their respective sections of the target area. In stage 3, distinct regions of the target area reconstructed in parallel. In stage 4, these smaller regions that were generated in stage 3 are merged to form the final target image. In stage 5, the final image is stored on disc. FPGA SWATH LUT PCI Staging BRAM Input BRAM Target Memory 1 Target Memory 2 This work was supported in part by CenSSIS, the Center for Subsurface Sensing and Imaging Systems, under the Engineering Research Centers Program of the National Science Foundation (Award Number EEC ). Research Level 1 Thrust R3 R1 R2 Fundamental Science Validating TestBEDs L1 L2 L3 R3 S1 S4 S5 S3 S2 Bio-MedEnviro- Civil Methods of serial computing are slow and can not take advantage of the inherent parallelism of the algorithm for processing SAR data. This work is focused on developing a high- speed computation engine that will enable image reconstruction in a small fraction of the time possible with serial computing.