Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.

Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University of Texas at Austin

2 Memory hierarchy trends Growing latency to main memory Growing cache complexity –More cache levels –New mechanisms, optimizations Growing application complexity –Lots of abstraction Application-System interactions increasingly hard to predict

3 The solution: More fine-grained metrics More insight within an application More rigorous comparisons across applications Potential applications: –Hardware/software tuning –Global hints for online phase detection Our approach: data structure decomposition High-level, easy to understand Highlights important access patterns

4 ammp vs twolf: The tale of two applications Conventional view: they’re pretty similar IPC: 0.57 vs 0.51 DL1 Miss-rate (%): 10% vs 9.5% Access patterns –Lots of pointer access in both.. –Mostly linked list traversal

5 ammp vs twolf: Data structure decomposition DL1 misses (%)

6 ammp vs twolf: Access patterns twolf t1 = b[c[i]  cblock] t2 = t1  tile  term t3 = n[t2  net] … i=rand() ammp atom atom=atom  next atom[i]  neighbour[j] ++j ++i twolf has more complex access patterns

7 ammp vs twolf: Phase behavior ammp twolf Time 60 billion cycles

8 ammp vs twolf: Phase behavior by data structure ammp twolf ammp has more interesting phase behavior

9 Outline Motivation Data structure decomposition Phase analysis: selecting sampling period Results: –Aggregate –Phase

10 Data structure decomposition Application communicates with simulator Leave core application oblivious; automatically add simulator-aware instrumentation Application Simulator Resources

11 DTrack Application Sources Detailed Statistics Application Executable Instrumented Sources Source Translator CompilerSimulator - DTrack’s protocol for application-simulator communication

12 DTrack’s protocol 1.Application stores mapping at a predetermined shared location –(start address, end address) → variable name 2...and signals simulator by special opcode Other techniques possible 3.Simulator detects signal, reads shared location 1.Application stores mapping at a predetermined shared location –(start address, end address) → variable name 2...and signals simulator by special opcode Other techniques possible 3.Simulator detects signal, reads shared location Simulator now knows variable names of address regions

13 Instrumentation without perturbance Global segment: write to file –Expensive, but one-time cost during initialization –Amortized across all global variables Heap: save in special variables after every malloc/free –Overhead α frequency of mallocs/frees –Special variables always hit in cache Stack: no instrumentation –Function calls too frequent –Causes negligible misses anyway

14 Measuring perturbance Communicate specific start and end points in application to simulator Compare instruction counts between them with and without instrumentation ΔInstruction count <4% even with frequent malloc

16 The importance of sampling period Good sampling period  Low noise DL1 misses/ 10M cycles (thousands) DL1 misses/ 230M cycles (thousands)

17 Volatility: A noise metric for time sequence graphs Raw data stream Volatility value Volatility graph Miss graph Aggregate for some Sampling period Point volatilities Sort, extract 90 th percentile Point volatility = abs(X t -X t-1 ) max(X t, X t-1 )

18 Volatility depends on sampling period Raw data stream Volatility value Volatility graph sampling Period Aggregate Point volatilities

19 Volatility profile: Volatility vs sampling period Sampling period (millions of samples) Volatility 164.gzip

21 Methodology A Source translator: C-Breeze B Compiler: Alpha GEM cc C Simulator: sim-alpha –Validated model of 21264 pipeline Simulated machine: Alpha 21264 –4-way issue, 64KB 3-cycle DL1 Benchmarks: 12 C applications from SPEC CPU2000 suite ABC

22 Major data structures by DL1 misses % DL1 misses

23 Most misses ≣ Most pipeline stalls? Process: –Detect stall cycles when no instructions were committed –Assign blame to data structure of oldest instruction in pipeline Results –Stall cycle ranks track miss count ranks –Exceptions: tds in 179.art search in 186.crafty

24 Types of phase behavior DL1 Misses (Millions) I. mcf II. art 115 billion cycles

25 DL1 Misses (Millions) III. mesa Types of phase behavior cycles

26 Summary More detailed metrics  richer application comparison Low-overhead data structure decomposition Determining ideal sampling period –A volatility metric inspired by spectral analysis Ideal sampling period is application-specific Data structures in an application share common phase boundaries

Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.

Similar presentations

Presentation on theme: "Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.

Similar presentations

Presentation on theme: "Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences."— Presentation transcript:

Similar presentations

About project

Feedback