Presentation is loading. Please wait.

Presentation is loading. Please wait.

Summary Problem: Exponential Performance Gap: Computer architectures transitioned from exponential frequency scaling to parallelism ending decades of free.

Similar presentations


Presentation on theme: "Summary Problem: Exponential Performance Gap: Computer architectures transitioned from exponential frequency scaling to parallelism ending decades of free."— Presentation transcript:

1 Summary Problem: Exponential Performance Gap: Computer architectures transitioned from exponential frequency scaling to parallelism ending decades of free exponential performance gains The natural “MapReduce” Belief Propagation (BP) algorithm: Embarrassingly Parallel Highly Inefficient: (Asymptotically slower than efficient sequential algorithms) Solution: Explore the limiting sequential structure using chain graphical models Introduce approximation which improves parallel performance Propose ResidualSplash, a new parallel dynamic BP Algorithm and show that it performs optimally on chain graphical models in the approximate inference setting Results: We demonstrate that our new algorithm outperforms existing techniques on two real-world tasks TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A A AAA

2 Many Core Revolution Transition from exponential frequency scaling to exponential parallelism 1985199019801970197519952000 2005 Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Boardcom 1480 20?? # cores 1 2 4 8 16 32 64 128 256 512 Ambric AM2045 4004 8008 80868080286386486PentiumP2P3 P4 Itanium Itanium 2Athlon Year Graph courtesy of Saman Amarasinghe TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A Parallel Performance Single Processor Performance Exponentially Growing Gap

3 Inference in Markov Random Fields X1X1 X2X2 X4X4 X5X5 X3X3 X6X6 X7X7 X8X8 X9X9 PixelsNoisy ImagePredicted Image MRF Unary Potentials Binary Potentials Pair wise Markov Random Field (MRF): Graph encoding conditional independence assumptions Factors encoding functional dependencies Inference Objective: Compute marginal distribution for all variables

4 Loopy Belief Propagation Loopy Belief Propagation: Approximate inference method Exact on trees At Convergence: X2X2 X1X1 X3X3 X4X4 X5X5 X2X2 X1X1 X3X3 X4X4 X5X5

5 Levels of Parallelism Message Level Parallelism Making a single message update calculation run in parallel Limited by the complexity of individual variables Graph Level Parallelism Simultaneously updating multiple messages More “parallelism” for larger models Running Time Definition: Message calculations as unit time operations Running time is measured in message calculations

6 “MapReduce” Belief Propagation Update all messages simultaneously using p · 2(n-1) processors. Chain graphs provide a challenging performance benchmark: CPU 1 CPU 2 Old Messages (t-1) New Messages (t) Write Shared Memory Iterate Read Only TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A AA n Rounds t=1 t=n 2n Message Calculations Running Time:

7 Optimal Sequential Scheduling Using one processor Send messages left to right and then right to left: Optimal Parallel Scheduling Using two processors Send messages left to right and right to left at the same time: n Rounds t=1 t=n CPU 1 CPU 2 CPU 1 Efficient Chain Scheduling 2n Rounds t=1 t=n+1 t=2n CPU 1 Running Time:

8 Efficiency Gap! For p<n the MapReduce algorithm is slower than the efficient single processor algorithm Cannot efficiently use more than 2 processors n Rounds t=1 t=n 2n Messages “MapReduce” Parallel 2n Rounds t=1 t=n+1 t=2n CPU 1 n Rounds t=1 t=n CPU 1 CPU 2 CPU 1 Optimal Single Processor Optimal Parallel (p=2) Factor n Gap!

9 Breaking Sequentially with ¿ ² -Approximation Message errors decay over paths: The value of ¿ ² Maximum length of dependencies for a given accuracy ² Not known in practice Not known to the algorithm 123456789 10 True Messages ¿ ² -Approximation Based on work by [Ihler et al., 2005]

10 Synchronous BP and ¿ ² -Approximation For an approximate marginal, we only need to consider a small ¿ ² subgraph 123456789 10 ¿ ² Steps t=1 t=n 2n Messages Theorem: Given an acyclic MRF with n vertices a ¿ ² - approximation is obtained by running Parallel Synchronous BP using p processors ( p · 2 n ) in running time:

11 Optimal Approximate Inference Evenly partition the vertices: Run sequential exact inference on each “tree” in parallel: We obtain the running time on chain graphs: Step 1Step 2 time per iteration Processor 1Processor 3 Processor 2 Processor 1Processor 3 Processor 2 Processor 1Processor 3 Processor 2 Processor 1Processor 3 Processor 2

12 Proof sketch: After k th iterations of parallel message computations in one direction: Theorem: For an arbitrary chain graphical model with n vertices and p processors, a ¿ ² -approximation cannot in general be computed with fewer message updates than: Total required work in one direction Maximum possible work done by a single processor Solving for k

13 Splash Operation Generalizes optimal tree inference: Construct a BFS tree of a fixed size Starting at leaves invoke SendMessages on each vertex [13,12,11,…,1] Start at root invoke send SendMessages on each vertex [1,2,3,…,13] SendMessages Routine: Using all current inbound messages compute all outbound messages 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 7 8 9 1 2 3 7 8 9 Splash(1) SendMessages(8)

14 Scheduling Splashes Not all vertices are equal: Difficult Easy AA BB Time = tTime = t+1 Wasted Work Useful Work Some vertices need to be updated more often than others

15 Residual Scheduling Intuition: Prioritize updating messages which change the most. Message Residual: Difference between current message value and next incoming message value Vertex Residual: Maximum of all incoming message residuals residual=0.1 Vertex update! residual=0.4 Update vertex residual!

16 Parallel Residual Splash Vertex 5 Vertex 91 Vertex 62 Vertex 22 Vertex 28 Shared Priority Queue Shared Memory CPU 1 CPU 2 Pop top vertex from queue Build BFS tree of size s Update vertices in tree in reverse BFS order. Update priority queue as needed Update Update vertices in tree in forward BFS order. Update priority queue as needed Update Return root vertex to queue 4 3 21 2 34 2 3 1 4 CPU 1

17 Residual Splash Running Time Theorem: For an arbitrary chain graphical model with n vertices and p processors (p · n) and a particular initial residual scheduling the Residual Splash algorithm computes a ¿ ² -approximation in time: Using a random initial priorities the Residual Splash algorithm computes a ¿ ² -approximation in time: We suspect that the log(p) factor is not tight.

18 Overall Performance: Non-uniform complexity TruePredicted Difficult Easy (1)(2)(3) (6)(5)(4) Region DifficultyExecution PhaseTotal Updates Log Scale

19 Experimental Setup Protein Side Chain prediction Video Popup Chen Yanover and Yair Weiss. Approximate Inference and Protein Folding. NIPS 2002 Predict protein side chain positions [Chen 02] 276 proteins Hundreds of variables per protein with arity up to 79 Average degree of 20 Extension of Make3D [ref] to videos with edges connecting pixels over frames Depths discretized to 40 levels. 500K vertices. 3D Grid MRF 107x86x60 Movie Depth Map Stereo Images 3D Movie (Anaglyph) Software Implementation Optimized GNU C++ using POSIX threads with MATLAB wrapper www.select.cs.cmu.edu/code

20 Protein Results Experiments performed on an 8-core AMD Opteron 2384 processor @ 2.7 Ghz with 32 GB RAM.

21 3D-Video Results Experiments performed on an 8-core AMD Opteron 2384 processor @ 2.7 Ghz with 32 GB RAM.

22 Conclusions and Future Work Trivially parallel MapReduce algorithm is inefficient Approximation can lead to increased parallelism Provided new parallel inference algorithm which performs optimal on chain graph and generalizes to loopy graphs Demonstrated superior performance on several real world tasks A cluster scale factor graph extension is under review Extend running time bounds to arbitrary cyclic graphical models Efficient parallel parameter learning

23 Acknowledgements David O’Hallaron and Jason Campbell from Intel Research Pittsburgh who provided guidance in algorithm and task development and access to the BigData multi-core cluster. Funding provided by: ONR Young Investigator Program Grant N00014-08-1-0752 ARO under MURI W911NF0810242 NSF Grants NeTS-NOSS and CNS-0625518 AT&T Labs Fellowship Program

24


Download ppt "Summary Problem: Exponential Performance Gap: Computer architectures transitioned from exponential frequency scaling to parallelism ending decades of free."

Similar presentations


Ads by Google