October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group
October 26, of 22 Why Parallel? Processing time Smaller timesteps, more scales, faster response times Memory Larger images, more dimensions Energy consumption More applications, smaller devices
October 26, of 22 Data parallelism Many image processing operations have locality of reference (segmentation, filtering, distance transforms, etc.) Data parallelism
October 26, of 22 Task farm parallelism An application consists of many different operations Some of these operations are independent (scale spaces, parameter sweeps, noise realizations, etc.) Task farm parallelism
October 26, of 22 Pipeline parallelism An image processing algorithm consists of consecutive stages If multiple objects are to be processed, they may be in different stages at the same time Pipeline parallelism
October 26, of 22 Parallel hardware architectures Fine grained Irregular Superscalar (most modern microprocessors) VLIW (DSPs) Regular Vector (supercomputers, MMX) SIMD (graphics processors) Custom FPGA
October 26, of 22 Parallel hardware architectures Coarse grained Homogeneous Multi-core, SMP Cluster Heterogeneous Embedded systems Grid
October 26, of 22 Obstacles Programming Synchronization, bookkeeping Different systems, languages, optimization strategies Choosing an architecture Analyze program before it is written Additional requirements or unexpected performance may require rewrite
October 26, of 22 Architecture-independent parallel programming Data parallelism Differentiate between synchronization pattern and computation Library provides pattern, user provides computation Task farm & pipeline parallelism Operations do not work on images, but on streams Sequences of operation calls do not imply an order, but a stream graph.
October 26, of 22 Algorithmic Skeletons +=+=
October 26, of 22 Example skeletons Pixel Neighbourhood Recursive neighbourhood Stack Filter Associative reduction
October 26, of 22 Constructing stream graphs By program (dynamic) capture(orig); normalize(orig, norm); dx(orig, x_der, 1.0); dy(orig, y_der, 1.0); direction(x_der, y_der, dir); display(dir); Visually (static) normalize dxdy direction display capture
October 26, of 22 Mapping stream graphs to processors Processor 1Processor 2
October 26, of 22 Dealing with heterogeneous tasks Processor 1Processor
October 26, of 22 Dealing with interconnect Processor 1Processor 2Interconnect
October 26, of 22 Dealing with dependencies Processor 1Processor 2Interconnect (3)+4(3)+7 (3) (3)+4 4
October 26, of 22 Choosing an architecture automatically Architecture-independent program allows automatic analyis after it is written, but before an architecture is chosen Based on certain constraints, architecture can be chosen automatically to optimize some cost function. Tradeoff between cost, power and performance must be made by the designer
October 26, of 22 Design Space Exploration Program Archi- tecture MetricsAnalyze Explore
October 26, of 22 Search strategy Constrained single objective minimum performance cost
October 26, of 22 Search strategy Multiobjective tradeoff iteration performance cost
October 26, of 22 Search strategy Strength Pareto performance cost
October 26, of 22 Conclusions Architecture-independent programming allows Parallel programming without bookkeeping Targeting heterogeneous systems Choosing the most appropriate architecture automatically
October 26, of 22 Overview Parallelism in image processing Parallel hardware architectures Architecture-independent parallel programming Algorithmic skeletons Stream programming Choosing an appropriate architecture Design Space Exploration
October 26, of 22 Exploiting parallelism Fine grained, irregular Superscalar Dataflow dispatch & reorder Most modern microprocessors Automatic by processor Very Long Instruction Word Multiple instructions per word DSPs, Itanium “Automatic” by compiler Ex Dispatch I Ex I I III
October 26, of 22 Exploiting parallelism Fine grained, regular Vector instructions Supercomputers MMX/SSEx Special instructions/datatypes Single Instruction Multiple Data Graphics processors Special languages Ex Dispatch I Ex I I I MMMM
October 26, of 22 Exploiting parallelism Coarse grained Multiprocessing Multiple processors/cores sharing a memory Shared-memory threading libraries (pthread, OpenMP) Clusters Relatively loosely coupled systems connected by a network Message-passing libraries (MPI) Heterogeneous systems Exploit differences in algorithmic requirements Multiple paradigms in a single application