ARM-DSP Multicore Considerations CT Scan Example.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

Programmable FIR Filter Design
CSCI 4717/5717 Computer Architecture
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 9 Programmable Configurations Read Only Memory (ROM) – –a fixed array of AND gates.
Digital Image Processing
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Using process knowledge to identify uncontrolled variables and control variables as inputs for Process Improvement 1.
Surface Reconstruction from 3D Volume Data. Problem Definition Construct polyhedral surfaces from regularly-sampled 3D digital volumes.
Computed Tomography RAD309
Introduction to Volume Rendering Presented by Zvi Devir.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
Chapter 15 Digital Signal Processing
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Course Website: Computer Graphics 5: Line Drawing Algorithms.
tomos = slice, graphein = to write
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Multicore Design Considerations KeyStone Training Multicore Applications Literature Number: SPRP811.
Multicore Design Considerations. Multicore: The Forefront of Computing Technology “We’re not going to have faster processors. Instead, making software.
Application of Digital Signal Processing in Computed tomography (CT)
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
EE 445S Real-Time Digital Signal Processing Lab Spring 2012 Lab #3.1 Digital Filters Some contents are from the book “Real-Time Digital Signal Processing.
02/12/02 (c) 2002 University of Wisconsin, CS 559 Filters A filter is something that attenuates or enhances particular frequencies Easiest to visualize.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
LEC ( 2 ) RAD 323. Reconstruction techniques dates back to (1917), when scientist (Radon) developed mathematical solutions to the problem of reconstructing.
Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.
Image Processing1 Image Filtering Filtering can be use to enhance some features and de- enhance others. Usually used as a pre-processing step.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
EXACT TM CT Scanner EXACT: The heart of an FAA-certified Explosives Detection Scanner 3-D Image.
Computer Graphics Texture Mapping
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Real-Time HD Harmonic Inc. Real Time, Single Chip High Definition Video Encoder! December 22, 2004.
Filtered Backprojection. Radon Transformation Radon transform in 2-D. Named after the Austrian mathematician Johann Radon RT is the integral transform.
Medical Image Analysis Image Reconstruction Figures come from the textbook: Medical Image Analysis, by Atam P. Dhawan, IEEE Press, 2003.
Visible-Surface Detection Jehee Lee Seoul National University.
By: Daniel BarskyNatalie Pistunovich Supervisors: Rolf HilgendorfInna Rivkin 10/06/2010.
Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.
IMAGE RECONSTRUCTION. ALGORITHM-A SET OF RULES OR DIRECTIONS FOR GETTING SPECIFIC OUTPUT FROM SPECIFIC INPUT.
Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.
CS TC 22 CT Basics CT Principle Preprocessing. 2 CT Basics CT principle preprocessing CS TC 22 Blockdiagram image processor.
The Digital Revolution Changing information. What is Digital?  Discrete values used for  Input  Processing  Transmission  Storage  Display  Derived.
Contiguous Memory Allocation Contiguous Memory Allocation  One of the simplest methods for allocating memory is to divide memory into.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
1 Modular Refinement of H.264 Kermin Fleming. 2 What is H.264? Mobile Devices Low bit-rate Video Decoder –Follow on to MPEG-2 and H.26x Operates on pixel.
The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded systems Kaushal Sanghai and Rick Gentile.
(c) 2002 University of Wisconsin, CS 559
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
ESPL 1 Motivation Problem: Amateur photographers often take low- quality pictures with digital still camera Personal use Professionals who need to document.
Image-Based Rendering Geometry and light interaction may be difficult and expensive to model –Think of how hard radiosity is –Imagine the complexity of.
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
Digital Image Processing Lecture 17: Segmentation: Canny Edge Detector & Hough Transform Prof. Charlene Tsai.
Machine Vision Edge Detection Techniques ENT 273 Lecture 6 Hema C.R.
: Chapter 5: Image Filtering 1 Montri Karnjanadecha ac.th/~montri Image Processing.
XRD data analysis software development. Outline  Background  Reasons for change  Conversion challenges  Status 2.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Computed tomography. Formation of a CT image Data acquisitionImage reconstruction Image display, manipulation Storage, communication And recording.
Chapter 10 Digital Signal and Image Processing
TI Information – Selective Disclosure
Digital readout architecture for Velopix
Edge Detection The purpose of Edge Detection is to find jumps in the brightness function (of an image) and mark them.
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Adaptive Filter A digital filter that automatically adjusts its coefficients to adapt input signal via an adaptive algorithm. Applications: Signal enhancement.
Real time signal processing
Presentation transcript:

ARM-DSP Multicore Considerations CT Scan Example

CT Scan Machine X-Ray source is rotating around the body Set of detectors (1D, 2D) collects values Requirements: – Minimize exposure to X-Ray – Obtain good quality view of the internal anatomy Technician monitors the quality of the imaging in real- time.

Computerized Tomography – Trauma Case Scanning hundreds slices to find if there is internal damage. Each slice takes about a second. At the same time, a physician sits in from of the display and has the ability to manipulate the images: – Rotate the images – Color certain values – Edge detection and other image processing algorithms

CT Algorithm (1) A typical system has a parallel source of X-Rays and a set of detectors. Each detector detects the absorption of the line integral between the source and the detector. The source rotates around the body and the detectors collect N sets of data. The next slide will demonstrate the geometry (taken from Digital Picture processing, Rosenfeld and Kak).

CT Algorithm(2)

CT Algorithm(3) CT processing has four parts: – Preprocessing – Back projector – Post processing – 3D processing

Pre-processing Performed on each individual set of data collected from all detectors in single angle. Conversion from absorption values to real values. Compensate on the variance in the geometry and detectors. – Dead detectors are interpolated Interpolation using FFT based convolution (x->2x) Other vector filtering operations (secret sauce)

Back Projector For each pixel accumulates the contributions of all the lines that passed through the pixel It involves interpolation between two rays, and adding to a value

Post-processing Image processing 2D filtering on the image: – Scatter (or anti-scatter) filter – Smooth filter (LPF) for a smoother image – Edge detection filter (The opposite) for identifying edges Setting the range for display (floating point to fixed point conversion) Other image based operations

3-D processing Using multiple slices to build 3D model

“My System” 800 detectors in a vector Scan time – 1 second per slice 360 vectors per slice Image size 512x512 pixels 400 slices per scan

Memory and IO Considerations Input data per second 800 * 360 * 2 = K Bytes Total input data (562.5K x 400) =~ 198MB Image memory holds 400 slices, 512x512 floating point each 512x512x4x400 = 400MB Pre-process single vector – Input size (single vector) 800 samples by 2 bytes –less than 2KB – Largest intermediate size (interpolated to 2048, complex floating point) 2048*4*2 = 16KB – Output size (real) 2047 x4 = 8KB – Filters coefficients, convolution coefficients etc –> 16KB

Memory and IO Considerations “My System” Back projector processing – Image size 512x512x4 = 1M Post processing and 3D processing – 400 images, 400M

System considerations When scanning starts, images are processed in the rate of one image per second. The operator needs to verify that all the settings are correct and that the configuration is correct. Operator looks at a single image at a time and changes the image setting. The operator looks at images slower than one per second and needs flexibility in setting image parameters and configurations. The same is true for 3D manipulations The operator does not have to look at all the images. So the image reconstruction rate is 1 per second, the image display rate (or 3D display) is much slower.

ARM - DSP considerations The ARM NEON and the FPU are optimized for vector/matrix bytes or half word operations. Linux has many image processing libraries and 3D emulation libraries available. Intuitively it looks like the ARM core will do all the image processing and the 3D processing. DSP cores are very good in filtering, FFT, convolution and the like. The back projector algorithm can be easily implemented by DSP code. Intuitively, it looks like the 8 DSP cores will do the preprocessing and the back projection operation.

My System Architecture

Building and Moving the Image In my system, the DSPs build the images in the shared memory – Unless shared by multiple DSPs (details later), L2 is too small for a complete image (1MB), and MSMC memory is large enough (up to 6MB) and closer to the DSP cores (than the DDR) Moving a complete back-projector image to DDR can be done in multiple ways: – One or more DSP cores – EDMA (in ping-pong buffer setting) – ARM core can read directly from the MSMC memory and write the processed image to DDR (again, in ping-pong setting) Regardless of the method, communication between DSP and ARM is essential.

Partition Considerations: Image Processing ARM core L2 cache is 4MB Single image size is 1MB Image processing can be done in multiple ways: – Each A15 processes an image – Each a15 processes part of an image – Each a15 processes a different algorithm 3D processing partition depends on the algorithm

Image Processing: Each A15 Processes a Different Image A15 supports write- through for L1 and L2 cache If the algorithm can work in-place (or does not need intermediate results), each A15 can process a different image. Each image is 1MB. Advantage: Same code, simple control Disadvantage: Longer delay, not efficient if intermediate results are needed (for the not-in- place case)

Image processing – Each A15 processes a part of the image The cache size supports double buffering and working area Intermediate results can be maintained in cache, output written to a non-cacheable area Advantage: all data and scratch buffers can fit in the cache, efficient processing, small delay Disadvantage – need to pay attention to the boundaries between processing zones of each A15

Image Processing: Each A15 Processes Part of the Algorithm Pipelines the processing, four images in the pipeline, requires buffers Advantage: Each A15 has a smaller program, can fit in the L1 P cache Disadvantage: Complex algorithm, difficult to balance between A15, not efficient if memory does not fit inside the cache.

So What Method Would You Choose? Looks like the second case (each A15 processes quarter of the image) is the preferred model.

DSP Cores Partition There are 360 vectors (each slice). The complete preprocessing of each vector is done by a single DSP core No reason to divide preprocessing of a single vector between multiple DSPs Partition preprocessing and back projector: – Some DSPs do preprocessing and some back projector (Functional Partition) OR – All DSPs do both preprocessing and back projector (Data Partition)

Partition Back Projector and Preprocessing N DSPs do preprocessing, 8-N back projector N is chosen to balance load (as much as possible) Vectors are moved from the preprocessing DSPs to the back projector DSP using shared memory or navigator

Input Vector Partition Each DSP sums a complete image – Vectors are divided between DSPs – Because of race conditions, each DSP has a local (private) image. At the end one DSP (or ARM) combine together all the local images. Private images resides outside of the core – Does not require broadcasting of vector values

Input Vector Partition Each DSP sums part of the image – Vectors are broadcast to all DSPs – Depends on the number of cores, partial image can fit inside L2 – Merging together into shared memory (DDR) at the end of the partial build

All DSPs Do Preprocessing and Back Projector Just like the previous case, there are two options, either each DSP builds part of the image or all DSP build partial image and then all partial images are combined into the final image

Input Vector Partition Each DSP sums a complete image – Raw vectors are divided between DSPs – Because of race conditions, each DSP has a local (private) image. At the end one DSP (or ARM) combine together all the local images. Private image resides outside of the core – Does not require broadcasting of raw vector values

Input Vector Partition Each DSP sums part of the image – Raw vectors are broadcast to all DSPs – Depends on the number of cores, partial image can fit inside L2 in addition to the filter coefficients – Merging together into shared memory (DDR) at the end of the partial build

So What Method Would You Choose? The optimal solution depends on the exact configuration of the system (number of detectors, number of rays, number of slices) and the algorithms that are used. Suggest to benchmark all (or subset) of the possibilities and choose the best one for a specific problem Any Questions?