Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ken Michelson David Sunderland Jared Wilkins Chris Fisher WaveScalar Martha Mercaldi Andrew Petersen Andrew Putnam Andrew Schwerin Mark Oskin Susan Eggers.

Similar presentations


Presentation on theme: "Ken Michelson David Sunderland Jared Wilkins Chris Fisher WaveScalar Martha Mercaldi Andrew Petersen Andrew Putnam Andrew Schwerin Mark Oskin Susan Eggers."— Presentation transcript:

1 Ken Michelson David Sunderland Jared Wilkins Chris Fisher WaveScalar Martha Mercaldi Andrew Petersen Andrew Putnam Andrew Schwerin Mark Oskin Susan Eggers Tom Anderson Carl Ebeling Hank Levy Steven Swanson WaveScalar Sponsored by NSF, Intel, The ARCS Foundation, Xilinx, and StoreTek Martha Mercaldi Andrew Petersen Andrew Putnam Andrew Schwerin Mark Oskin Susan Eggers Tom Anderson Carl Ebeling Hank Levy Steven Swanson Ken Michelson David Sunderland Jared Wilkins Chris Fisher

2 UPC November, 2004 University of Washington We should all be going to SIGCOMM Things to keep you up at night ~2016 Opportunities 8 billion transistors; 28Ghz 4GB per DRAM chip 120 P4s OR 200,000 RISC-1 per die Challenges Communication Defects Complexity Performance

3 UPC November, 2004 University of Washington Monolithic von Neumann Processors A phenomenal success today. But in 2016?  Communication Broadcast networks  Defect tolerance 1 flaw -> paperweight  Complexity 40-60% of design is validation  Performance Deeper pipes unlikely (ISCA02)

4 UPC November, 2004 University of Washington Decentralized Processors Communication Defect tolerance Complexity ? Performance But how do you execute?

5 UPC November, 2004 University of Washington Von Neumann is Centralized PC-driven fetch is the problem One program counter Dataflow is the solution

6 UPC November, 2004 University of Washington Dataflow has been done before... Operations fire when data is available No program counter Convert true control dependences to data dependences Exposes massive parallelism But...

7 UPC November, 2004 University of Washington...it had issues Scalability Dataflow never executed mainstream code No total load-store ordering Special languages Different memory semantics No mutable data structures (mostly) Functional (mostly)

8 UPC November, 2004 University of Washington The WaveScalar ISA A dataflow ISA with imperative language support: The best of both worlds Von Neumann Normal memory semantics. Coarse-grain, von Neumann-style threads Dataflow “Unordered” memory. Fine-grain, dataflow-style threads Use the best tool for the job.

9 UPC November, 2004 University of Washington WaveScalar example A[j + i*i] = i; b = A[i*j];

10 UPC November, 2004 University of Washington WaveScalar example A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

11 UPC November, 2004 University of Washington WaveScalar example A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

12 UPC November, 2004 University of Washington WaveScalar example A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

13 UPC November, 2004 University of Washington WaveScalar example A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

14 UPC November, 2004 University of Washington WaveScalar example A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

15 UPC November, 2004 University of Washington WaveScalar example A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

16 UPC November, 2004 University of Washington WaveScalar example A[j + i*i] = i; b = A[i*j]; * Load Store + j i * b A + +

17 UPC November, 2004 University of Washington Wave-ordered memory Compiler annotates memory operations Send memory requests in any order Hardware reconstructs the correct order Load Store Load Store Load Store 3 4 8 5 6 7 Sequence # 4 ? 9 6 8 8 Successor 2 3 ? 4 5 4 Predecessor

18 UPC November, 2004 University of Washington Store buffer Wave-ordering Example 4?3 784 89? Load Store Load Store Load Store 5 6 6 8 34289? 4 5 7844?3342

19 UPC November, 2004 University of Washington Wave-ordered Memory Waves are loop-free sections of the dataflow graph Each dynamic wave has a wave number Wave-ordered memory Wave-numbers Sequence number

20 UPC November, 2004 University of Washington WaveScalar Execution Model * Load Store + j i * b A + + Put an ALU at every word of instruction memory. No processor core. Instructions communicate directly.

21 UPC November, 2004 University of Washington The WaveCache The I-Cache is the processor.

22 UPC November, 2004 University of Washington Processing Element

23 UPC November, 2004 University of Washington Domain

24 UPC November, 2004 University of Washington Cluster

25 UPC November, 2004 University of Washington The WaveCache Long distance communication Dynamic routing Grid-based network 1 cycle/cluster Traditional cache coherence Normal memory hierarchy 16K instructions

26 UPC November, 2004 University of Washington The WaveCache in Action!

27 UPC November, 2004 University of Washington Performance Cycle-accurate simulator Binary translator from Alpha -> WaveScalar assembly A selection of Spec2000 and MediaBench WaveCache ~2000 Processing elements No speculation Compare to a very aggressive superscalar 15-stage, 16-wide 1024- registers, 1024-entry issue queue

28 UPC November, 2004 University of Washington WaveCache Performance

29 UPC November, 2004 University of Washington Decentralized Processing Communication Defect tolerance Complexity High Performancce

30 UPC November, 2004 University of Washington Multithreading the WaveCache What is a thread? A “flow of control”? Von Neumann: PC + registers. WaveScalar: A memory ordering? How do threads work in WaveScalar? ISA changes Architectural changes

31 UPC November, 2004 University of Washington ISA Support for Threads Extend tag with “ThreadID” Instructions for memory ordering management Mem_Sequence_Start -- associate ordering with a ThreadID Mem_Sequence_Stop -- destroy ordering

32 UPC November, 2004 University of Washington Thread Synchronization Memory-based (Test-And-Set) Spin on memory Memory-free (Thread_Coordinate) A queue lock.

33 UPC November, 2004 University of Washington Hardware support for threads Very little must change Wider busses Wider input queues More store buffers ThreadIDs control instruction replication One copy of each instructions/ThreadID

34 UPC November, 2004 University of Washington Multithreaded Performance

35 UPC November, 2004 University of Washington The WaveScalar ISA Von Neumann Normal memory semantics. Coarse-grain, von Neumann-style threads Dataflow “Unordered” memory. Fine-grain, dataflow-style threads

36 UPC November, 2004 University of Washington Unordered memory Load_Unordered A normal load (but not wave-ordered) Store_Unordered Write to memory and return a value. Mem_nop_Ack A no-op, but returns a value upon execution. Coordination point between ordered and unordered operations.

37 UPC November, 2004 University of Washington Exploiting Unordered Memory Fine-grain intermingling struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; }

38 UPC November, 2004 University of Washington Exploiting Unordered Memory Fine-grain intermingling struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; } St *a, 0 Mem_nop_ack Ld *b Ld p->x Ld p->y St r.x St r.y + Ordered Unordered

39 UPC November, 2004 University of Washington Exploiting Unordered Memory Fine-grain intermingling struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; } St *a, 0 Mem_nop_ack Ld *b Ld p->x Ld p->y St r.x St r.y + Ordered Unordered

40 UPC November, 2004 University of Washington Exploiting Unordered Memory Fine-grain intermingling struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; } St *a, 0 Mem_nop_ack Ld *b Ld p->x Ld p->y St r.x St r.y + Ordered Unordered

41 UPC November, 2004 University of Washington Exploiting Unordered Memory Fine-grain intermingling struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; } St *a, 0 Mem_nop_ack Ld *b Ld p->x Ld p->y St r.x St r.y + Ordered Unordered

42 UPC November, 2004 University of Washington Exploiting Unordered Memory Fine-grain intermingling struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; } St *a, 0 Mem_nop_ack Ld *b Ld p->x Ld p->y St r.x St r.y + Ordered Unordered

43 UPC November, 2004 University of Washington Exploiting Unordered Memory Fine-grain intermingling struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; } St *a, 0 Mem_nop_ack Ld *b Ld p->x Ld p->y St r.x St r.y + Ordered Unordered

44 UPC November, 2004 University of Washington Exploiting Unordered Memory Fine-grain intermingling struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; } St *a, 0 Mem_nop_ack Ld *b Ld p->x Ld p->y St r.x St r.y + Ordered Unordered

45 UPC November, 2004 University of Washington Exploiting Unordered Memory Fine-grain intermingling struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; } St *a, 0 Mem_nop_ack Ld *b Ld p->x Ld p->y St r.x St r.y + Ordered Unordered

46 UPC November, 2004 University of Washington Exploiting Unordered Memory Fine-grain intermingling struct { int x,y; } Pair; foo(Pair *p, int *a, int *b) { Pair r; *a = 0; r.x = p->x; r.y = p->y; return *b; } St *a, 0 Mem_nop_ack Ld *b Ld p->x Ld p->y St r.x St r.y + Ordered Unordered

47 UPC November, 2004 University of Washington Dataflow performance

48 UPC November, 2004 University of Washington Putting it all together: Equake Finite element earthquake simulation >90% execution is in two functions Sim() Series of data-independent loops Initialization and copying Thread pool implementation Smvp() Cross-iteration dependences Basically matrix multiplications Rewrite in WaveScalar assembly

49 UPC November, 2004 University of Washington Putting it all together: Equake (3.5) (11) Single-threaded

50 UPC November, 2004 University of Washington WaveScalar’s Future Steven Swanson Martha Mercaldi Andrew Petersen Andrew Putnam Andrew Schwerin Mark Oskin Susan Eggers Tom Anderson Carl Ebeling Hank Levy Ken Michelson David Sunderland Jared Wilkens Chris Fisher

51 UPC November, 2004 University of Washington Microarchitecture (Steven Swanson, Andrew Putnam, Ken Michelson) Domain How to spend wires? What are PEs? Network topology and routing SystemC model

52 UPC November, 2004 University of Washington Microarchitecture Status HDL done PE Domain Store buffer/cache Network switch 4x4 WaveCache, 8PEs/domain ~160mm^2 @ 90nm Tools estimate ~250-300Mhz

53 UPC November, 2004 University of Washington Instruction Placement (Martha Mercaldi) Static vs. Dynamic Simulated annealing Instruction migration Which instruction to evict? How aggressively?

54 UPC November, 2004 University of Washington Compiler (Andrew Petersen, David Sunderland) Custom WaveScalar optimizations Unordered memory operations Alias Analysis Re-examine well-known optimizations Is software pipelining useful? Dataflow Languages SISAL, Id, etc. ??? Compiler C C++ ???

55 UPC November, 2004 University of Washington Operating System (Andrew Schwerin) Cache and Address organization Coherence protocols Fine-grained protection domains.

56 UPC November, 2004 University of Washington FPGA Prototype (Chris Fisher, Jared Wilkens) FPGA prototype Boards 4 FPGA w/ 2 PPC cores DDR Memory SRAM Attached to a PPC “Brain”

57 UPC November, 2004 University of Washington Conclusions WaveScalar ISA A unified dataflow and von Neumann execution model Mix-and-Match parallelism models WaveCache Architecture Outperforms an OOO superscalar by 2.8x Excellent multi-threaded performance. Over 300 IPC for hand-coded apps. And you can build it today!!! Enormous opportunities for future research


Download ppt "Ken Michelson David Sunderland Jared Wilkins Chris Fisher WaveScalar Martha Mercaldi Andrew Petersen Andrew Putnam Andrew Schwerin Mark Oskin Susan Eggers."

Similar presentations


Ads by Google