Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)

Similar presentations


Presentation on theme: "A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)"— Presentation transcript:

1 A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)

2 Golden era of computer architecture ~ 3 years behind CPU92 CPU95 CPU2000 CPU2006 Year SPEC CINT Performance (log. Scale) Era of DIY: Multicore Reconfigurable GPUs Clusters 10 Cores! 10-Core Intel Xeon “Unparalleled Performance”

3 P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994) Automatic Speculation Automatic Pipelining Parallel Resources Automatic Allocation/Scheduling Commit

4 M ULTICORE A RCHITECTURE (C IRCA 2010) Automatic Pipelining Parallel Resources Automatic Speculation Automatic Allocation/Scheduling Commit

5

6 Realizable parallelism Parallel Library Calls Time Threads Credit: Jack Dongarra

7 “Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law

8 Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. Parallel Programming Automatic Parallelization Parallel Libraries Computer Architecture Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization A Roadmap to restoring computing’s former glory.

9 Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis

10 LD:1 LD:2 W:1 W:3 LD:3 Core 1Core 2Core 3 W:2 W:4 LD:4 LD:5 C:1 C:2 C:3 Core 4 Spec-PS-DSWP P6 SUPERSCALAR ARCHITECTURE

11 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } B1 C1 A1 Core 1Core 2 Core 3 A2 B2 D1 C2 D2 Time Program Dependence Graph AB D C Control Dependence Data Dependence PDG

12 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } B1 C1 A1 Core 1Core 2 Core 3 A2 B2 D1 C2 D2 Time Spec-DOALL SpecDOALL Program Dependence Graph AB D C Control Dependence Data Dependence

13 Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } Core 1Core 2 Core 3 Time Spec-DOALL A2 B2C2 D2 A1 B1C1 D1 A3 B3C3 D3 SpecDOALL Program Dependence Graph AB D C Control Dependence Data Dependence

14 Example B: node = node->next; C: res = work(node); D: write(res); } Core 1Core 2 Core 3 Time Program Dependence Graph AB D C Control Dependence Data Dependence Spec-DOALL A2A1A3 B2 C2 D2 B1 C1 D1 B3 C3 D3 A: while (node) { while (true) { B2 C2 D2 B3 C3 D3 B4 C4 D4 197.parser Slowdown SpecDOALLPerf

15 Core 1Core 2 Core 3 Time C1 D1 B1 B7 C3 D3 B3 C4 D4 B4 C5 D5 B5 C6 B6 Spec-DOACROSS Core 1Core 2 Core 3 Time Spec-DSWP C2 D2 B2 C1 D1 B1 B3 B4 B2 C2 C3 D2 B5 B6 B7 D3 C5 C6 C4 D5 D4 Throughput: 1 iter/cycle DOACROSSDSWP

16 Comparison: Spec-DOACROSS and Spec-DSWP Comm.Latency = 2: Comm.Latency = 1: 1 iter/cycle Core 1Core 2 Core 3 Time C1 D1 B1 C2 D2 B2 C3 D3 B3 Core 1Core 2 Core 3 B2 B3 B1 B5 B6 B4 C2 C3 C1 C5 C6 C4 B7 Pipeline Fill time 0.5 iter/cycle 1 iter/cycle D2 D3 D1 D5 D4 Time C4 D4 B4 C5 D5 B5 C6 B6 LatencyProblem B7

17 TLS vs. Spec-DSWP [MICRO 2010] Geomean of 11 benchmarks on the same cluster

18 Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining.  2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis

19 19 char *memory; void * alloc(int size); void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6

20 20 char *memory; void * alloc(int void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6

21 21 char *memory; void * alloc(int Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Easily Understood Non-Determinism!

22 [MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11] ~50 of ½ Million LOCs modified in SpecINT 2000 Mods also include Non-Deterministic Branch

23 Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining.  2.Low overhead access to programmer insight.  3.Code reuse. Ideally, this includes support of legacy codes as well as new codes.  4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis

24 24 Sum Reduction Sum Reduction Unroll Rotate 0.90X 0.10X 30.0X 1.1X 0.8X Sum Reduction Sum Reduction Unroll Sum Reduction Sum Reduction Rotate Unroll 1.5X Iterative Compilation [Cooper ‘05; Almagor ‘04; Triantafyllis ’05]

25 PS-DSWP Complainer PS-DSWP Complainer

26 Red Edges: Deps between malloc() & free() Blue Edges: Deps between rand() calls Green Edges: Flow Deps inside Inner Loop Orange Edges: Deps between function calls Unroll Sum Reduction Sum Reduction Rotate PS-DSWP Complainer PS-DSWP Complainer Who can help me? Programmer Annotation Programmer Annotation

27 PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction

28 PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative

29 PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative LIBRARY Commutative LIBRARY Commutative

30 PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative LIBRARY Commutative LIBRARY Commutative

31 Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining.  2.Low overhead access to programmer insight.  3.Code reuse. Ideally, this includes support of legacy codes as well as new codes.  4.Intelligent automatic parallelization.  New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis

32 Performance relative to Best Sequential 128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]

33 Restoration of Trend

34 “Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law Compiler Technology Architecture/Devices Era of DIY: Multicore Reconfigurable GPUs Clusters Compiler technology inspired class of architectures?

35 The End


Download ppt "A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)"

Similar presentations


Ads by Google