Image Stitching for Optical Microscopy

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
Advertisements

Symantec 2010 Windows 7 Migration Global Results.
1 Copyright © 2005, Oracle. All rights reserved. Introducing the Java and Oracle Platforms.
Operational Weather Forecasting using GPUs Dr. Shujia Zhou Lawrence Sebald.
Acceleration of Cooley-Tukey algorithm using Maxeler machine
Figure 12–1 Basic computer block diagram.
More Intel machine language and one more look at other architectures.
Shredder GPU-Accelerated Incremental Storage and Computation
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud, M. Botsch, M. Gross Graphics Hardware 2008.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Dataflow Programming with MaxCompiler.
Introduction to the CUDA Platform
GPU Programming using BU Shared Computing Cluster
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
+ Accelerating Fully Homomorphic Encryption on GPUs Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, Berk Sunar ECE Dept., Worcester Polytechnic Institute.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Fluid Simulation using CUDA Thomas Wambold CS680: GPU Program Optimization August 31, 2011.
Panda: MapReduce Framework on GPU’s and CPU’s
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Virtualization Concept. Virtualization  Real: it exists, you can see it.  Transparent: it exists, you cannot see it  Virtual: it does not exist, you.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
7th Workshop on Fusion Data Processing Validation and Analysis Integration of GPU Technologies in EPICs for Real Time Data Preprocessing Applications J.
A Multithreading C# Data Synchronization Program and Its Realization Course: ECE 1747H Parallel Programming Professor: Christiana Amza Student / Presenter:
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.
Sobolev Showcase Computational Mathematics and Imaging Lab.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
VirtualBox What you need to know to build a Virtual Machine.
Accelerating MATLAB with CUDA
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Diane Marinkas CDA 6938 April 30, Outline Motivation Algorithm CPU Implementation GPU Implementation Performance Lessons Learned Future Work.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
Dense Image Over-segmentation on a GPU Alex Rodionov 4/24/2009.
OpenCL Framework for Heterogeneous CPU/GPU Programming a very brief introduction to build excitement NCCS User Forum, March 20, 2012 György (George) Fekete.
CUDA-based Volume Rendering in IGT Nobuhiko Hata Benjamin Grauer.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows TIMOTHY BLATTNER NIST | UMBC 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
J.J. Keijser Nikhef Amsterdam Grid Group MyFirstMic experience Jan Just Keijser 26 November 2013.
NFV Compute Acceleration APIs and Evaluation
Our Graphics Environment
CS427 Multicore Architecture and Parallel Computing
Using Vector Capabilities of GPUs to Accelerate FFT
Presentation transcript:

Image Stitching for Optical Microscopy Timothy Blattner, Bertrand Stivalet, Walid Keyrouz Shujia Zhou IAB 2013-6-13

Image Stitching for Optical Microscopy Objectives Stitching of optical microscopy images at interactive rates General purpose library, ImageJ/Fiji plug-in, etc. Success criterion Transformative impact Run sample problem in < 1 min > 10x speed improvement IAB 2013-6-13

Credits Joe Chalfoun, Mary Brady NIST IAB 2013-6-13

Image Stitching Problem Optical microscopes scan a plate and take overlapping partial images (tiles) Need to assemble image tiles into one large image Modern microscopy automated: Scientists are acquiring & processing large sets of images IAB 2013-6-13

Image Stitching Problem… Header 2012-09-28 Image Stitching Problem… Two phases: Compute the X & Y translations for all tiles Apply the translations & compose the stitched image Main focus is on phase I IAB 2013-6-13 Footer

Image Stitching Algorithm Loop over all images: Read an image tile Compute its FFT-2D Compute correlation coefficients with west and north neighbors Depends on FFT-2D for each tile Major compute portions: FFT-2D of tiles Compute and normalize phase correlation Inverse FFT-2D Reduce max normalize IAB 2013-6-13

Algorithm’s Parallel Characteristics Almost embarrassingly parallel Large number of independent computations For an n x m grid: FFT for all images nxm NCC for all image pairs 2nxm - n - m FFT-1 for the NCCs of all image pairs 2nxm - n - m … Caveats Data FFT dependencies Limited memory IAB 2013-6-13

Data Set Grid of 59x42 images (2478) 1392x1040 16-bit grayscale images (2.8 MB per image) ~ 7 GB Source: Kiran Bhadriraju (NIST) IAB 2013-6-13

Evaluation Platform Hardware Dual Intel® Xeon® E-5620 CPUs (quad-core, 2.4 GHz, hyper-threading) 24 GB RAM Dual NVIDIA® TeslaTM C2070 cards Reference Implementations Fiji™ Stitching plugin, >3.6 hours MATLAB® prototype, ~17.5 minutes on a similar machine Software Ubuntu Linux 12.04/x86_64, kernel 3.2.0 Libc6 2.1.5, libstd++6 4.6 BOOST 1.48, FFTW 3.3, libTIFF4 NVIDIA CUDA & CUFFT 5.0 IAB 2013-6-13

Implementations & Results FFTW Exhaustive, CUDA 5.0 Time Speedup CPU Threads GPUs C++ Sequential 10 min 37 sec 1 Simple Multi-Threaded 1 min 48 sec 5.8x 8 Pipelined Multi-Threaded 1 min 22 sec 7.7x 19 Simple GPU 9 min 47 sec 1.08x Pipelined-Hybrid 25 sec 25.5x 13 2 IAB 2013-6-13

Java Implementation Allows easy integration info Fiji Tool used by many biologists for image stitching Pure Java code is extremely slow FFT computations Cross correlation Use JNI with FFTW and C code Java native interface Allows calling functions off of the virtual machine Requires compilation (gcc) IAB 2013-6-13

Java Implementation Runtimes 42x59 Tiles Threads Sequential > 4 hours 1 Sequential with JNI ~30 minutes Pipelined with JNI 3 min 42 sec 16 IAB 2013-6-13

Closure—General 25x speedup compared to Sequential C++ code 518x speedup compared to Fiji stitching plugin Representative data set: 42x59 grid ~25 sec Can budget compute time to: Generate stitched image Carry out additional analysis Enables computationally steerable experiments IAB 2013-6-13

Closure—Java Implementation Single threaded-executes full grid in ~45 minutes using FFTW native interface Multi-threaded executes in ~4 minutes Optimized version uses native intrinsics for computing cross correlation Provides simple integration into Fiji application Need to provide JNI for GPU functions JCUDA IAB 2013-6-13

Thank You Questions? IAB 2013-6-13