Presentation is loading. Please wait.

Presentation is loading. Please wait.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Similar presentations


Presentation on theme: "PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The."— Presentation transcript:

1 PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The University of Texas at Austin Laboratory for Computer Architecture

2 Outline 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 2  Brief Description of UltraSPARC T1 Architecture  Analysis Objectives / Methodology  Analysis of Results  Interference on Shared Resources  Scaling of Multiprogrammed Workloads  Scaling of Multithreaded Workloads

3 UltraSPARC T1 (Niagara) 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 3  A multi-threaded processor that combines CMP & SMT in CMT  8 cores with each one handling 4 hardware context threads  32 active hardware context threads  Simple in-order pipeline with no branch predictor unit per core  Optimized for multithreaded performance  Throughput  High throughput  hide the memory and pipeline stalls/latencies by scheduling other available threads with zero cycle thread switch penalty

4 UltraSPARC T1 Core Pipeline 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 4  Thread Group shares L1 cache, TLBs, execution units, pipeline registers and data path  Blue areas are replicated copies per hardware context thread

5 Objectives 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 5  Purpose  Analysis of interference of multiple executing threads on the shared resources of Niagara  Scaling abilities of CMT architectures for both multiprogrammed and multithreaded workloads  Methodology  Interference on Shared Resources (SPEC CPU2000)  Scaling of a Multiprogrammed Workload (SPEC CPU2000)  Scaling of a Multithreaded Workloads (SPECjbb2005)

6 Analysis Objectives / Methodology 6 4/16/2015D. Kaseridis - Laboratory for Computer Architecture

7 Methodology (1/2) 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 7  On-chip performance counters for real/accurate results  Niagara:  Solaris10 tools : cpustat, cputrack, psrset to bind processes to H/W threads  2 counters per Hardware Thread with one only for Instruction count

8 Methodology (2/2) 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 8  Niagara has only one FP unit  only integer benchmark was considered  Performance Counter Unit in the granularity of a single H/W context thread  No way to break down effects of more threads per H/W thread  Software profiling tools too invasive  Only pairs of benchmarks was considered to allow correlation of benchmarks with events  Many iterations and use average behavior

9  Interference on shared resources  Scaling of a multiprogrammed workload  Scaling of a multithreaded workload Analysis of Results 9 4/16/2015D. Kaseridis - Laboratory for Computer Architecture

10 Interference on Shared Resources 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 10 Two modes considered:  “Same core” mode executes a benchmark on the same core  Sharing of pipeline, TLBs, L1 bandwidth  More like an SMT  “Two cores” mode execute each member of pair on a different core  Sharing of L2 capacity/bandwidth and main memory  More like an CMP

11 Interference “same core” (1/2) 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 11  On average 12% drop of IPC when running in a pair  Crafty followed by twolf showed the worst performance  Eon best behavior keeping the IPC almost close to the single thread case

12 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 12 Interference “same core” (2/2)  DC misses increased 20% on average / 15% taking out crafty  Worst DC misses are vortex and perlbmk  Highest ratios of L2 misses demonstrated are not the one that features an important decrease in IPC  mcf and eon pairs with more than 70% L2 misses  Overall, small performance penalty even when sharing pipeline and L1, L2 bandwidth  latency hiding technique is promising

13 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 13  Only stressing L2 and shared communication buses  On average the misses on L2 are almost the same as in the case on “same core”:  underutilized the available resources  Multiprogrammed workload with no data sharing Interference “two cores”

14 Scaling of Multiprogrammed Workload 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 14  Reduced benchmark pair set  Scaling 4  8  16 threads with configurations

15 Scaling of Multiprogrammed Workload 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 15  “Same core”  “Mixed mode” mode

16 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 16 Scaling of Multiprogrammed “same core”  4  8 case  IPC / Data cache misses not affected  L2 data misses increased but IPC is not  Enough resources running fully occupied  memory latency hiding  8  16 case  More cores running same benchmark  Some footprint / request to L2 /Main memory  L2 requirements / shared interconnect traffic decreased performance IPC ratio DC misses ratio L2 misses ratio

17 Scaling of Multiprogrammed “mixed mode” 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 17  Mixed mode case  Significant decrease in IPC when moving both from 4  8 and 8  16 threads  Same behavior as “same core” case for DC and L2 misses with an average of 1% - 2% difference  Overall for both modes  Niagara demonstrated that moving from 4 to 16 threads can be done with less than 40% on average performance drop  Both modes showed that significantly increased L1 and L2 misses can be handed favoring throughput IPC ratio

18 Scaling of Multithreaded Workload 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 18  Scaled from 1 up to 64 threads  1  8 threads mapped 1 thread per core  8  16 threads mapped at maximum 2 threads per core  16  32 threads up to 4 threads per core  32  64 more threads per core, swapping is necessary Configuration used for SPECjbb2005

19 Scaling of Multithreaded Workload 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 19 SPECjbb2005 score per warehouse GC effect

20 Scaling of Multithreaded Workload 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 20  Ratio over 8 threads case with 1 thread per core  Instruction fetch and DTLB stressed the most  L1 data and L2 Caches managed to scale even for more then 32 threads GC effect

21 Scaling of Multithreaded Workload 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 21  Scaling of Performance  Linear scaling of almost 0.66 per thread up to 32 threads  20x speed up at 32 threads  SMP (2 Threads/core) gives on average 1.8x speed up over the CMP configuration (region 1  SMT (up to 4 Threads/core) gives a 1.3x and 2.3x speedup over the 2- way SMT per core and the single-threaded CMP, respectively.

22 Conclusions 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 22  Demonstration of interference on a real CMT system  Long latency hiding technique is effective for L1 and L2 misses and therefore could be a good/promising technique against aggressive speculation  Promising scaling up to 20x for multithreaded workloads with an average of 0.66x per thread  Instruction fetch subsystem and DTLBs the most contented resources followed by L2 cache misses

23 Q/A 4/16/2015D. Kaseridis - Laboratory for Computer Architecture 23 Thank you… Questions? The Laboratory for Computer Architecture web-site:


Download ppt "PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The."

Similar presentations


Ads by Google