Presentation is loading. Please wait.

Presentation is loading. Please wait.

Task and Data Parallelism: Real-World Examples Sasha Goldshtein | SELA Group.

Similar presentations


Presentation on theme: "Task and Data Parallelism: Real-World Examples Sasha Goldshtein | SELA Group."— Presentation transcript:

1 Task and Data Parallelism: Real-World Examples Sasha Goldshtein | SELA Group

2 Why This Talk? Multicore machines have been a cheap commodity for >10 years Adoption of concurrent programming is still slow Patterns and best practices are scarce We discuss the APIs first… …and then turn to examples, best practices, and tips www.devreach.com

3 Task Parallel Library Evolution The Future DataFlow in.NET 4.5 (NuGet) Augmented with language support ( await, async methods) 2012 Released in full glory with.NET 4.0 2010 Incubated for 3 years as “Parallel Extensions for.NET” 2008 www.devreach.com

4 Tasks A task is a unit of work May be executed in parallel with other tasks by a scheduler (e.g. Thread Pool) Much more than threads, and yet much cheaper Task t = Task.Factory.StartNew( () => { return DnaSimulation(…); }); t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted); t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion); DisplayProgress(); www.devreach.com try //The C# 5.0 version { var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task); } catch (Exception ex) { Show(ex); }

5 Parallel Loops Ideal for parallelizing work over a collection of data Easy porting of for and foreach loops Beware of inter-iteration dependencies! Parallel.For(0, 100, i => {... }); Parallel.ForEach(urls, url => { webClient.Post(url, options, data); }); www.devreach.com

6 Parallel LINQ Mind-bogglingly easy parallelization of LINQ queries Can introduce ordering into the pipeline, or preserve order of original elements var query = from monster in monsters.AsParallel() where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster; query.ForAll(monster => Move(monster)); www.devreach.com

7 Measuring Concurrency Parallelizing code is worthless if you can’t measure the results Visual Studio Concurrency Visualizer to the rescue www.devreach.com

8 Recursive Parallelism Extraction Divide-and-conquer algorithms are often parallelized through the recursive call Need to be careful with parallelization threshold and watch out for dependencies void FFT(float[] src, float[] dst, int n, int r, int s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time } www.devreach.com Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2) ); DEMO

9 Symmetric Data Processing For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution Parallel.For(0, image.Rows, i => { for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); } }); Inter-iteration dependencies complicate things (think in-place blur) www.devreach.com

10 Uneven Work Distribution With non-uniform data items, use custom partitioning or manual distribution Checking primality: 7 is easier to check than 10,320,647 Not ideal (and complicated!): var work = Enumerable.Range(0, Environment.ProcessorCount).Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1)))); Task.WaitAll(work.ToArray()); Faster (and simpler!): Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2) ); www.devreach.com DEMO

11 Complex Dependency Management With dependent work units, parallelization is no longer automatic and simple Must extract all dependencies and incorporate them into the algorithm Typical scenarios: 1D loops, dynamic algorithms Levenshtein string edit distance: each task depends on two predecessors, computation proceeds as a wavefront C = x[i-1] == y[i-1] ? 0 : 1; D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C); www.devreach.com 0,0 m,n DEMO

12 Synchronization => Aggregation Excessive synchronization brings parallel code to its knees Try to avoid shared state, or minimize access to it Aggregate thread- or task-local state and merge later Parallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List (), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes); }); www.devreach.com DEMO

13 Creative Synchronization We implement a collection of stock prices, initialized with 10 5 name/price pairs Access rate: 10 7 reads/s, 10 6 “update” writes/s, 10 3 “add” writes/day Many reader threads, many writer threads Store the data in two dictionaries: safe and unsafe GET(key): if safe contains key then return safe[key] lock { return unsafe[key] } PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value } www.devreach.com

14 Lock-Free Patterns (1) Try to void Windows synchronization and use hardware synchronization Primitive operations such as Interlocked.Increment, Interlocked.Add, Interlocked.CompareExchange Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms, ConcurrentQueue and ConcurrentStack int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r; } www.devreach.com New Value Comparand

15 Lock-Free Patterns (2) User-mode spinlocks ( SpinLock class) can replace locks you acquire very often, which protect tiny computations //Sample implementation of SpinLock class __DontUseMe__SpinLock { private volatile int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { Thread.MemoryBarrier(); _lck = 0; } www.devreach.com

16 Miscellaneous Tips (1) Don’t mix several concurrency frameworks in the same process Some parallel work is best organized in pipelines – TPL DataFlow Some parallel work can be offloaded to the GPU – C++ AMP void vadd_exp(float* x, float* y, float* z, int n) { array_view avX(n, x), avY(n, y); array_view avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index i)... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize(); } www.devreach.com BroadcastBlock TransformBlock ActionBlock

17 Miscellaneous Tips (2) Invest in SIMD parallelization of heavy math or data-parallel algorithms START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4 jns START Make sure to take cache effects into account, especially on MP systems www.devreach.com A’s cache B’s cache CL state = S CL state = S Thread does CL[0] = 15 RFO Set CL as I RFO Set CL as I

18 Conclusions Avoid shared state and synchronization Parallelize judiciously and apply thresholds Measure and understand performance gains or losses Concurrency and parallelism are still hard A body of best practices, tips, patterns, examples is being built: Microsoft Parallel Computing Center: http://msdn.microsoft.com/concurrencyhttp://msdn.microsoft.com/concurrency Concurrent Programming on Windows (Joe Duffy): http://s.sashag.net/SLwEhIhttp://s.sashag.net/SLwEhI Parallel Programming with.NET: http://parallelpatterns.codeplex.com/http://parallelpatterns.codeplex.com/ www.devreach.com

19 Thank You! @goldshtn blog.sashag.net www.linkedin.com/in/goldshtn Sasha Goldshtein | SELA Group Two free Pro.NET Performance eBooks! Send an email to sashag@sela.co.il with an answer to the following question: Which class can be used to generate markers in the Visual Studio Concurrency Visualizer reports?


Download ppt "Task and Data Parallelism: Real-World Examples Sasha Goldshtein | SELA Group."

Similar presentations


Ads by Google