A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)

Golden era of computer architecture 1992 2012 199419961998200020022004 200620082010 ~ 3 years behind CPU92 CPU95 CPU2000 CPU2006 Year SPEC CINT Performance (log. Scale) Era of DIY: Multicore Reconfigurable GPUs Clusters 10 Cores! 10-Core Intel Xeon “Unparalleled Performance”

P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994) Automatic Speculation Automatic Pipelining Parallel Resources Automatic Allocation/Scheduling Commit

M ULTICORE A RCHITECTURE (C IRCA 2010) Automatic Pipelining Parallel Resources Automatic Speculation Automatic Allocation/Scheduling Commit

Realizable parallelism Parallel Library Calls Time Threads Credit: Jack Dongarra

“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law

Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. Parallel Programming Automatic Parallelization Parallel Libraries Computer Architecture Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization A Roadmap to restoring computing’s former glory.

Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis

0 1 2 3 4 5 LD:1 LD:2 W:1 W:3 LD:3 Core 1Core 2Core 3 W:2 W:4 LD:4 LD:5 C:1 C:2 C:3 Core 4 Spec-PS-DSWP P6 SUPERSCALAR ARCHITECTURE

Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } B1 C1 A1 Core 1Core 2 Core 3 A2 B2 D1 C2 D2 Time Program Dependence Graph AB D C Control Dependence Data Dependence PDG

Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } B1 C1 A1 Core 1Core 2 Core 3 A2 B2 D1 C2 D2 Time Spec-DOALL SpecDOALL Program Dependence Graph AB D C Control Dependence Data Dependence

Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } Core 1Core 2 Core 3 Time Spec-DOALL A2 B2C2 D2 A1 B1C1 D1 A3 B3C3 D3 SpecDOALL Program Dependence Graph AB D C Control Dependence Data Dependence

Example B: node = node->next; C: res = work(node); D: write(res); } Core 1Core 2 Core 3 Time Program Dependence Graph AB D C Control Dependence Data Dependence Spec-DOALL A2A1A3 B2 C2 D2 B1 C1 D1 B3 C3 D3 A: while (node) { while (true) { B2 C2 D2 B3 C3 D3 B4 C4 D4 197.parser Slowdown SpecDOALLPerf

Core 1Core 2 Core 3 Time C1 D1 B1 B7 C3 D3 B3 C4 D4 B4 C5 D5 B5 C6 B6 Spec-DOACROSS Core 1Core 2 Core 3 Time Spec-DSWP C2 D2 B2 C1 D1 B1 B3 B4 B2 C2 C3 D2 B5 B6 B7 D3 C5 C6 C4 D5 D4 Throughput: 1 iter/cycle DOACROSSDSWP

Comparison: Spec-DOACROSS and Spec-DSWP Comm.Latency = 2: Comm.Latency = 1: 1 iter/cycle Core 1Core 2 Core 3 Time C1 D1 B1 C2 D2 B2 C3 D3 B3 Core 1Core 2 Core 3 B2 B3 B1 B5 B6 B4 C2 C3 C1 C5 C6 C4 B7 Pipeline Fill time 0.5 iter/cycle 1 iter/cycle D2 D3 D1 D5 D4 Time C4 D4 B4 C5 D5 B5 C6 B6 LatencyProblem B7

TLS vs. Spec-DSWP [MICRO 2010] Geomean of 11 benchmarks on the same cluster

Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining.  2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis

19 char *memory; void * alloc(int size); void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6

20 char *memory; void * alloc(int size); @Commutative void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6

21 char *memory; void * alloc(int size); @Commutative Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Easily Understood Non-Determinism!

[MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11] ~50 of ½ Million LOCs modified in SpecINT 2000 Mods also include Non-Deterministic Branch

Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining.  2.Low overhead access to programmer insight.  3.Code reuse. Ideally, this includes support of legacy codes as well as new codes.  4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis

24 Sum Reduction Sum Reduction Unroll Rotate 0.90X 0.10X 30.0X 1.1X 0.8X Sum Reduction Sum Reduction Unroll Sum Reduction Sum Reduction Rotate Unroll 1.5X Iterative Compilation [Cooper ‘05; Almagor ‘04; Triantafyllis ’05]

PS-DSWP Complainer PS-DSWP Complainer

Red Edges: Deps between malloc() & free() Blue Edges: Deps between rand() calls Green Edges: Flow Deps inside Inner Loop Orange Edges: Deps between function calls Unroll Sum Reduction Sum Reduction Rotate PS-DSWP Complainer PS-DSWP Complainer Who can help me? Programmer Annotation Programmer Annotation

PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction

PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative

PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative LIBRARY Commutative LIBRARY Commutative

Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining.  2.Low overhead access to programmer insight.  3.Code reuse. Ideally, this includes support of legacy codes as well as new codes.  4.Intelligent automatic parallelization.  New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis

Performance relative to Best Sequential 128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]

Restoration of Trend

“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law Compiler Technology Architecture/Devices Era of DIY: Multicore Reconfigurable GPUs Clusters Compiler technology inspired class of architectures?

The End

A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)

Similar presentations

Presentation on theme: "A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)

Similar presentations

Presentation on theme: "A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)"— Presentation transcript:

Similar presentations

About project

Feedback