Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2004 Mark D. HillWisconsin Multifacet Project Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill.

Similar presentations


Presentation on theme: "© 2004 Mark D. HillWisconsin Multifacet Project Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill."— Presentation transcript:

1 © 2004 Mark D. HillWisconsin Multifacet Project Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill Computer Sciences Dept. and Electrical & Computer Engineer Dept. University of Wisconsin—Madison Multifacet Project (www.cs.wisc.edu/multifacet)www.cs.wisc.edu/multifacet October 2004 Full Disclosure: Consult for Sun & US NSF

2 Wisconsin Multifacet Project © 2004 Mark D. Hill 2 Executive Summary: Problem Expect computer performance doubling every 2 years Derives from Technology & Architecture Technology will advance for ten or more years But Architecture faces a Rock: Slow Memory –a.k.a. Wall [Wulf & McKee 1995] Prediction: Popular Moore’s Law (doubling performance) will end soon, regardless of the real Moore’s Law (doubling transistors) talk

3 Wisconsin Multifacet Project © 2004 Mark D. Hill 3 Executive Summary: Recommendation Chip Multiprocessing (CMP) Can Help –Implement multiple processors per chip –>>10x cost-performance for multithreaded workloads –What about software with one apparent thread? Go to Hard Place: Mainstream Multithreading –Make most workloads flourish with chip multiprocessing –Computer architects can help, but long run –Requires moving multithreading from CS fringe to center (algorithms, programming languages, …, hardware) Necessary For Restoring Popular Moore’s Law

4 Wisconsin Multifacet Project © 2004 Mark D. Hill 4 Outline Executive Summary Background –Moore’s Law –Architecture –Instruction Level Parallelism –Caches Going Forward Processor Architecture Hits Rock Chip Multiprocessing to the Rescue? Go to the Hard Place of Mainstream Multithreading

5 Wisconsin Multifacet Project © 2004 Mark D. Hill 5 Society Expects A Popular Moore’s Law Computing critical: commerce, education, engineering, entertainment, government, medicine, science, … –Servers (> PCs) –Clients (= PCs) –Embedded (< PCs) Come to expect a misnamed “Moore’s Law” –Computer performance doubles every two years (same cost) –  Progress in next two years = All past progress Important Corollary –Computer cost halves every two years (same performance) –  In ten years, same performance for 3% (sales tax – Jim Gray) Derives from Technology & Architecture talk

6 Wisconsin Multifacet Project © 2004 Mark D. Hill 6 (Technologist’s) Moore’s Law Provides Transistors Number of transistors per chip doubles every two years (18 months) Merely a “Law” of Business Psychology

7 Wisconsin Multifacet Project © 2004 Mark D. Hill 7 Performance from Technology & Architecture Reprinted from Hennessy and Patterson,"Computer Architecture: A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.

8 Wisconsin Multifacet Project © 2004 Mark D. Hill 8 Architects Use Transistors To Compute Faster Bit Level Parallelism (BLP) within Instructions Instruction Level Parallelism (ILP) among Instructions Scores of speculative instructions look sequential! Time   Instrns Time   Instrns

9 Wisconsin Multifacet Project © 2004 Mark D. Hill 9 Architects Use Transistors Tolerate Slow Memory Cache –Small, Fast Memory –Holds information (expected) to be used soon –Mostly Successful Apply Recursively –Level-one cache(s) –Level-two cache Most of microprocessor die area is cache!

10 Wisconsin Multifacet Project © 2004 Mark D. Hill 10 Outline Executive Summary Background Going Forward Processor Architecture Hits Rock –Technology Continues –Slow Memory –Implications Chip Multiprocessing to the Rescue? Go to the Hard Place of Mainstream Multithreading

11 Wisconsin Multifacet Project © 2004 Mark D. Hill 11 Future Technology Implications For (at least) ten years, Moore’s Law continues –More repeated doublings of number of transistors per chip –Faster transistors But hard for processor architects to use –More transistors due global wire delays –Faster transistors due too much dynamic power Moreover, hitting a Rock: Slow Memory –Memory access = 100s floating-point multiplies! –a.k.a. Wall [Wulf & McKee 1995]

12 Wisconsin Multifacet Project © 2004 Mark D. Hill 12 Rock: Memory Gets (Relatively) Slower Reprinted from Hennessy and Patterson,"Computer Architecture: A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.

13 Wisconsin Multifacet Project © 2004 Mark D. Hill 13 Impact of Slow Memory (Rock) Off-Chip Misses are now hundreds of cycles More Realistic Case Good Case! Time   Instrns Time   Instrns I1 I2 I3 I4 window = 4 (64) Compute Phases Memory Phases

14 Wisconsin Multifacet Project © 2004 Mark D. Hill 14 Implications of Slow Memory (Rock) Increasing Memory Latency hides Compute Phase Near Term Implications –Reduce memory latency –Fewer memory accesses –More Memory Level Parallelism (MLP) Longer Term Implications –What can single-threaded software do while waiting 100 instruction opportunities, 200, 400, … 1000? –What can amazing speculative hardware do?

15 Wisconsin Multifacet Project © 2004 Mark D. Hill 15 Assessment So Far Appears –Popular Moore’s Law (doubling performance) will end soon, regardless of the real Moore’s Law (doubling transistors) Processor performance hitting Rock (Slow Memory) No known way to overcome this, unless Redefine performance in Popular Moore’s Law –From Processor Performance –To Chip Performance

16 Wisconsin Multifacet Project © 2004 Mark D. Hill 16 Outline Executive Summary Background Going Forward Processor Architecture Hits Rock Chip Multiprocessing to the Rescue? –Small & Large CMPs –CMP Systems –CMP Workload Go to the Hard Place of Mainstream Multithreading

17 Wisconsin Multifacet Project © 2004 Mark D. Hill 17 Performance for Chip, not Processor or Thread Chip Multiprocessing (CMP) Replicate Processor Private L1 Caches –Low latency –High bandwidth Shared L2 Cache –Larger than if private

18 Wisconsin Multifacet Project © 2004 Mark D. Hill 18 Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz CPU Next few slides from Luiz Barosso’s ISCA 2000 presentation of Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing

19 Wisconsin Multifacet Project © 2004 Mark D. Hill 19 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way D$I$

20 Wisconsin Multifacet Project © 2004 Mark D. Hill 20 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay D$I$ ICS CPU D$I$ CPU D$I$ CPU D$I$ CPU D$I$ CPU D$I$ CPU D$I$ CPU D$I$

21 Wisconsin Multifacet Project © 2004 Mark D. Hill 21 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$

22 Wisconsin Multifacet Project © 2004 Mark D. Hill 22 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ MEM-CTL 8

23 Wisconsin Multifacet Project © 2004 Mark D. Hill 23 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE)  prog., 1K  instr., even/odd interleaving D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ MEM-CTL RE HE

24 Wisconsin Multifacet Project © 2004 Mark D. Hill 24 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE):  prog., 1K  instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ MEM-CTL RE HE Router 4 8GB/s

25 Wisconsin Multifacet Project © 2004 Mark D. Hill 25 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE):  prog., 1K  instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ MEM-CTL RE HE Router

26 Wisconsin Multifacet Project © 2004 Mark D. Hill 26 Piranha’s performance margin 3x for OLTP and 2.2x for DSS Piranha has more outstanding misses  better utilizes memory system Single-Chip Piranha Performance

27 Wisconsin Multifacet Project © 2004 Mark D. Hill 27 Simultaneous Multithreading (SMT) Multiplex S logical processors on each processor –Replicate registers, share caches, & manage other parts –Implementation factors keep S small, e.g., 2-4 Cost-effective gain if threads available –E.g, S=2  1.4x performance Modest cost –Limits waste if additional logical processor(s) not used Worthwhile CMP enhancement

28 Wisconsin Multifacet Project © 2004 Mark D. Hill 28 Small CMP Systems Use One CMP (with C cores of S-way SMT) –C=[2,16] & S=[2,4]  C*S = [4,64] –Size of a small PC! Directly Connect CMP (C) to Memory Controller (M) or DRAM MCC

29 Wisconsin Multifacet Project © 2004 Mark D. Hill 29 Medium CMP Systems Use 2-16 CMPs (with C cores of S-way SMT) –Smaller: 2*4*4 = 32 –Larger: 16*16*4 = 1024 –In a single cabinet Connecting CMPs & Memory Controllers/DRAM & many issues CC CC MM MM Processor-Centric MM CC MM CC Dance Hall

30 Wisconsin Multifacet Project © 2004 Mark D. Hill 30 Inflection Points Inflection point occurs when –Smooth input change leads –Disruptive output change Enough transistors for … –1970s simple microprocessor –1980s pipelined RISC –1990s speculative out-of-order –2000s … CMP will be Server Inflection Point –Expect >10x performance for less cost –Implying, >>10x cost-performance –Early CMPs like old SMPs but expect dramatic advances!

31 Wisconsin Multifacet Project © 2004 Mark D. Hill 31 So What’s Wrong with CMP Picture? Chip Multiprocessors –Allow profitable use of more transistors –Support modest to vast multithreading –Will be inflection point for commercial servers But –Many workloads have single thread (available to run) –Even if single thread solves a problem formerly done by many people in parallel (e.g., clerks in payroll processing) Go to a Hard Place –Make most workloads flourish with CMPs

32 Wisconsin Multifacet Project © 2004 Mark D. Hill 32 Outline Executive Summary Background Going Forward Processor Architecture Hits Rock Chip Multiprocessing to the Rescue? Go to the Hard Place of Mainstream Multithreading –Parallel from Fringe to Center –For All of Computer Science!

33 Wisconsin Multifacet Project © 2004 Mark D. Hill 33 Thread Parallelism from Fringe to Center History –Automatic Computer (vs. Human)  Computer –Digital Computer (vs. Analog)  Computer Must Change –Parallel Computer (vs. Sequential)  Computer –Parallel Algorithm (vs. Sequential)  Algorithm –Parallel Programming (vs. Sequential)  Programming –Parallel Library (vs. Sequential)  Library –Parallel X (vs. Sequential)  X Otherwise, repeated performance doublings unlikely

34 Wisconsin Multifacet Project © 2004 Mark D. Hill 34 Computer Architects Can Contribute Chip Multiprocessor Design –Transcend pre-CMP multiprocessor design –Intra-CMP has lower latency & much higher bandwidth Hide Multithreading (Helper Threads) Assist Multithreading (Thread-Level Speculation) Ease Multithreaded Programming (Transactions) Provide a “Gentle Ramp to Parallelism” (Hennessy)

35 Wisconsin Multifacet Project © 2004 Mark D. Hill 35 But All of Computer Science is Needed Hide Multithreading (Libraries & Compilers) Assist Multithreading (Development Environments) Ease Multithreaded Programming (Languages) Divide & Conquer Multithreaded Complexity (Theory & Abstractions) Must Enable –99% of programmers think sequentially while –99% of instructions execute in parallel Enable a “Parallelism Superhighway”

36 Wisconsin Multifacet Project © 2004 Mark D. Hill 36 Summary (Single-Threaded) Computing faces a Rock: Slow Memory Popular Moore’s Law (doubling performance) will end soon Chip Multiprocessing Can Help –>>10x cost-performance for multithreaded workloads –What about software with one apparent thread? Go to Hard Place: Mainstream Multithreading –Make most workloads flourish with chip multiprocessing –Computer architects can help, but long run –Requires moving multithreading from CS fringe to center Necessary For Restoring Popular Moore’s Law


Download ppt "© 2004 Mark D. HillWisconsin Multifacet Project Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill."

Similar presentations


Ads by Google