Presentation is loading. Please wait.

Presentation is loading. Please wait.

Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University.

Similar presentations


Presentation on theme: "Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University."— Presentation transcript:

1 Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University of Edinburgh

2 Intl. Symp. on Microarchitecture - December Introduction Source: Intel Multi-cores and many-cores here to stay

3 Intl. Symp. on Microarchitecture - December Introduction Multi-cores and many-cores are here to stay Parallel programming is essential to realize potential Focus on coarse-grain parallelism Weak or no scaling of some parallel applications Can we exploit under-utilized cores to complement coarse-grain parallelism? –Nested parallelism in multi-threaded applications –Exploit it using implicit speculative parallelism

4 Intl. Symp. on Microarchitecture - December Contributions Evaluation of implicit speculative parallelism on top of explicit parallelism to improve scalability: –Improve scalability by 40% on avg. –Same energy consumption Detailed analysis of multithreaded scalability: –Performance bottlenecks –Behavior on different input datasets Auto-tuning to dynamically select the number of explicit and implicit threads

5 Intl. Symp. on Microarchitecture - December Outline Introduction Motivation Proposal Evaluation Methodology Results Conclusions

6 Intl. Symp. on Microarchitecture - December Bottlenecks: Large Critical Sections T0T0 T1T1 T2T2 T3T3 Time Integer Sort (IS) NASPB

7 Intl. Symp. on Microarchitecture - December Bottlenecks: Load Imbalance T0T0 T1T1 T2T2 T3T3 Time RADIOSITY SPLASH 2 Can we use these cores to accelerate this app.?

8 Intl. Symp. on Microarchitecture - December Outline Introduction Motivation Proposal Evaluation Methodology Results Low power nested parallelism Conclusions

9 9 Proposal Programming: –Users explicitly parallelize code –Tradeoff development time for performance gains Architecture and Compiler: –Exploit fine-grain parallelism on top of user threads –Thread-Level Speculation (TLS) within each user thread Hardware: –Support both explicit and implicit threads simultaneously in a nested fashion Intl. Symp. on Microarchitecture - December 2011

10 Speculative 10 Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } T0T0 TKTK TLTL TMTM ……… T K,i T K,i+1 T K,i+2 T K,i+3 Speculative T L,i T L,i+1 T L,i+2 T L,i+3 Intl. Symp. on Microarchitecture - December 2011

11 11 Proposal: Many-core Architecture Many-core partitioned in clusters (tiles) Coherence (MESI) –Snooping coherence within cluster –Directory coherence across clusters Support for TLS only within cluster –Snooping TLS protocol –Speculative buffering in L1 data caches Intl. Symp. on Microarchitecture - December 2011

12 12 Proposal: Many-core Architecture T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 T8T8 T9T9 T 10 T 11 T 12 T 13 T 14 T 15 T 16 T 17 T 18 T 19 T 20 T 21 T 22 T 23 T 24 T 25 T 26 T 27 T 28 T 29 T 30 T 31 Mem. Contr. C0C0 C1C1 C2C2 C3C3 ICDCICDCICDCICDC L2 $ Dir/ Router Intl. Symp. on Microarchitecture - December 2011

13 13 Complementing Coarse-Grain Parallelism T0T0 T1T1 T2T2 T3T3 Time T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 2x Explicit Threads

14 Intl. Symp. on Microarchitecture - December Complementing Coarse-Grain Parallelism T0T0 T1T1 T2T2 T3T3 Time T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 4ETs + 4ISTs

15 Intl. Symp. on Microarchitecture - December Complementing Coarse-Grain Parallelism T0T0 T1T1 T2T2 T3T3 Time T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 2x Explicit Threads

16 Intl. Symp. on Microarchitecture - December Complementing Coarse-Grain Parallelism T0T0 T1T1 T2T2 T3T3 Time T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 4ETs + 4ISTs

17 Intl. Symp. on Microarchitecture - December Expected Speedup Behavior

18 18 Proposal: Auto-Tuning the Thread Count Find the scalability tipping point dynamically Choose whether to employ implicit threads Simple hill climbing approach Applicable to OpenMP applications that are amenable to Dynamic Concurrency Throttling (DCT [Curtis-Maury PACT08] ) Developed a prototype in the Omni OpenMP System Intl. Symp. on Microarchitecture - December 2011

19 19 Auto-tuning example … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } … Learning i omp parallel region i detected: First time: Can we compute iteration count statically and is less than max core count? Yes -> set Initial Tcount to 32 Measure execution time t i 1 M=32

20 Intl. Symp. on Microarchitecture - December Auto-tuning example … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } … Learning ii omp parallel region i detected: Set Tcount to next value (16) Measure execution time t i 2 t i 2 < t i 1 continue exploration

21 Intl. Symp. on Microarchitecture - December Auto-tuning example … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } … Learning iii omp parallel region i detected: Set Tcount to next value (8) Measure execution time t i 3 t i 3 > t i 2 stop exploration

22 Intl. Symp. on Microarchitecture - December Auto-tuning example … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } … Learning iiii omp parallel region i detected: Use Tcount = 16, no further exploration Set TLS to 4-way

23 Intl. Symp. on Microarchitecture - December Outline Introduction Motivation Proposal Evaluation Methodology Results Conclusions

24 24 Evaluation Methodology SESC simulator - extended to model our scheme Architecture: –Core: 4-issue OoO superscalar, 96-entry ROB, 3GHz 32KB, 4-way, DL1 $ - 32KB, 2-way, IL1 $ 16Kbit Hybrid Branch Predictor –Tile/System: 128 cores partitioned in 2-way or 4-way tiles (evaluate both) Shared L2 cache, 8MB, 8-way, 64MSHRs Directory: Full-bit vector sharer list Interconnect: Grid, 64B links - 48GB/s to main memory Intl. Symp. on Microarchitecture - December 2011

25 25 Evaluation Methodology Benchmarks : –12 workloads from PARSEC 2.1, SPLASH2, NASPB –Simulate parallel region to completion Compilation : –MIPS binaries generated using GCC –Speculation added automatically through source-to- source compiler –Selection of speculation regions through manual profiling Power: –CACTI 4.2 and Wattch Intl. Symp. on Microarchitecture - December 2011

26 26 Evaluation Methodology Alternative schemes compared against: –Core Fusion [Ipek ISCA07]: Dynamic combination of cores to deal with lowly-threaded apps Approximated through wide 8-issue cores with all the core resources doubled without latency increase => upper bound –Frequency Boost: Inspired by Turbo Boost [Intel08] For each idle core one other core gains a frequency boost of 800MHz with a 200mV increase in voltage (same power cap) All these schemes shift resources to a subset of cores in order to improve performance Intl. Symp. on Microarchitecture - December 2011

27 27 Outline Introduction Motivation Proposal Evaluation Methodology Results Conclusions

28 Intl. Symp. on Microarchitecture - December Bottom Line Speedup over best scalability point TLS-4: 41% avg TLS-2:27% avg

29 Intl. Symp. on Microarchitecture - December Energy Showing best performing point for each scheme Energy consumption slightly lower on avg

30 Intl. Symp. on Microarchitecture - December Energy Showing best performing point for each scheme Spending less time in busy synchronization

31 Intl. Symp. on Microarchitecture - December Energy Showing best performing point for each scheme High mispeculation: Higher energy

32 Intl. Symp. on Microarchitecture - December Energy Showing best performing point for each scheme Little synchronization: Higher energy

33 Intl. Symp. on Microarchitecture - December Serial/Critical Sections is NASPB

34 Intl. Symp. on Microarchitecture - December Load Imbalance radiosity SPLASH2

35 Intl. Symp. on Microarchitecture - December Synchronization Heavy ocean SPLASH2

36 Intl. Symp. on Microarchitecture - December Coarse-Grain Partitioning swaptions PARSEC

37 Intl. Symp. on Microarchitecture - December Poor Static Partitioning sp NASPB

38 Intl. Symp. on Microarchitecture - December Effect of Dataset size Unchanged behavior: cholesky Also: canneal, ocean, ft, is, sp

39 Intl. Symp. on Microarchitecture - December Effect of Dataset size Improved scalability, but TLS boost remains: swaptions Also: bodytrack, radiosity, ep

40 Intl. Symp. on Microarchitecture - December Effect of Dataset size Improved scalability, lessened TLS boost: streamcluster

41 Intl. Symp. on Microarchitecture - December Effect of Dataset size Worse scalability, even better TLS boost: water

42 Intl. Symp. on Microarchitecture - December Outline Introduction Motivation Proposal Evaluation Methodology Results Conclusions

43 Intl. Symp. on Microarchitecture - December Conclusions Multicores and many-cores are here to stay –Parallel programming essential to exploit new hardware –Some coarse-grain parallel programs do not scale –Enough nested parallelism to improve scalability Proposed speculative parallelization through implicit speculative threads on top of explicit threads: –Significant scalability improvement of 40% on avg –No increase in total energy consumptions –Presented an auto-tuning mechanism to dynamically choose the number of threads that performs within 6% of the oracle

44 Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University of Edinburgh

45 Intl. Symp. on Microarchitecture - December Related Work [von Praun PPoPP07] Implicit ordered transactions [Kim Micro10] Speculative Parallel-stage Decoupled Software Pipelining [Ooi ICS01] Multiplex [Madriles ISCA09] Anaphase [Rajwar MICRO01],[Martinez ASPLOS02] Speculative Lock Elision [Moravan ASPLOS06], etc., Nested transactional memory

46 Intl. Symp. on Microarchitecture - December Bibliography [Intl08] Intel Corp. Intel Turbo Boost Technology in Intel Core Microarchitecture (Nehalem) Based Processors, White Paper, 2008 [Ipek ISCA07] Ipek et al. Core fusion: Accommodating software diversity in chip multiprocessors [von Praun PPoPP07] C. von Praun et al. Implicit parallelism with ordered transactions, PPoPP 2007 [Kim Micro10] Scalable speculative parallelization in commodity clusters, MICRO, 2010 [Ooi ICS01] C.-L Ooi et al. Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor, ICS 2001 [Madriles ISCA09] C. Madriles et al. Boosting single-thread performance in multi-core system through fine-grain multi-threading. ISCA 2009

47 Intl. Symp. on Microarchitecture - December Bibliography [Rajwar MICRO01] R. Rajwar and J.R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. MICRO 2001 [Martinez ASPLOS02] J. Martinez and J. Torellas. Speculative synchronization: Applying thread-level speculation to explicitly parallel applications. ASPLOS 2002 [Moravan ASPLOS06] Supporting nested transactional memory in logtm. ASPLOS 2006 [Curtis-Maury PACT08] Prediction models for multi-dimensional power- performance optimization on many-cores.

48 Intl. Symp. on Microarchitecture - December Benchmark details

49 Intl. Symp. on Microarchitecture - December Fetched Instructions

50 Intl. Symp. on Microarchitecture - December Failed Speculation

51 Intl. Symp. on Microarchitecture - December Serial/Critical Sections bodytrack PARSEC

52 Intl. Symp. on Microarchitecture - December Background: Speculative Parallelization Assume no dependences and execute threads in parallel Track data accesses Detect violations Squash offending threads and restart them for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … }

53 53 Background: Speculative Parallelization Speculative Time TJTJ T J+1 T J+2 for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } iteration J … = A[4] + … … A[5] = … iteration J+1 … = A[2] + … … A[2] = … iteration J+2 … = A[5] + … … A[6] = … ld A[4] st A[5] ld A[2] ld A[5] RAW Intl. Symp. on Microarchitecture - December 2011

54 54 Energy Showing best performing point for each scheme

55 Intl. Symp. on Microarchitecture - December Bottom Line Speedup over best scalability point

56 Intl. Symp. on Microarchitecture - December Auto-tuning OpenMP apps Performs within 6% of static oracle


Download ppt "Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University."

Similar presentations


Ads by Google