Presentation is loading. Please wait.

Presentation is loading. Please wait.

February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.

Similar presentations


Presentation on theme: "February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University."— Presentation transcript:

1 February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University February 12, 1998

2 February 12, 1999 Architecture and Circuits: 2 On-chip wires 0.0mm 2.5mm 5.0mm 7.5mm 10.0mm Minimum width wire in an 0.35  m process

3 February 12, 1999 Architecture and Circuits: 3 On-chip wires are getting slower x1x1 x2x2 y y x 2 = s x 1 0.5x R 2 = R 1 /s 2 4x C 2 = C 1 1x t w2 = R 2 C 2 y 2 = t w1 /s 2 4x t w2 /t g2 = t w1 /(t g1 s 3 )8x v = 0.5(t g RC) -1/2 (m/s) v 2 = v 1 s 1/2 0.7x vt g = 0.5(t g /RC) 1/2 (m/gate) v 2 t g2 = v 1 t g1 s 3/2 0.35x t w = RCy 2 RCy 2 tgtg tgtg tgtg

4 February 12, 1999 Architecture and Circuits: 4 Technology scaling makes communication the scarce resource 0.35  m 64Mb DRAM 16 64b FP Proc 400MHz 0.10  m 4Gb DRAM 1K 64b FP Proc 2.5GHz 1998 2008 18mm 12,000 tracks 1 clock repeaters every 3mm 32mm 90,000 tracks 20 clocks repeaters every 0.4mm P

5 February 12, 1999 Architecture and Circuits: 5 Architecture Must Evolve to Fit the Landscape 20 Clocks 90,000 tracks Local, parallel operations High bandwidth Low latency & Low power Global operations Low bandwidth High latency & High power

6 February 12, 1999 Architecture and Circuits: 6 Architecture Today Depends on Fast Global Communication Regs I-Unit All instructions issued from single global instruction unit All data passes through global register file This won’t work when global accesses cost 20 clocks of latency

7 February 12, 1999 Architecture and Circuits: 7 Tomorrow’s Architectures must Exploit Locality and Expose Communication Multiple elements (clusters) with –local instruction dispatch –local register files –co-located with arithmetic elements Explicit communication between elements through a switch or network Fast synchronization between instruction units RegsIURegsIURegsIURegsIU Switch

8 February 12, 1999 Architecture and Circuits: 8 Multi-ALU Processor Chip

9 February 12, 1999 Architecture and Circuits: 9 1x1.64x5.25x Standard-Cell Full-Custom Crafted-Cell 80 Different Cells7 Different Cells17 Different Cells Design IRRDP ADDSUB Full- Custom Crafted- Cell Standard Cell 2.23x 2.7x 1.11x 1.17x 1.0x Performance Area -Results courtesy of Andrew Chang Crafted-Cell Design

10 February 12, 1999 Architecture and Circuits: 10 Interconnect: repeaters with switching Need repeaters every 1mm or less Easy to insert switching –zero-cost reconfiguration Can’t afford decision time –static routing fixed or regular pattern –source routing on-demand requires arbitration and fanout Queuing and flow-control Pipelining control 1mm ArbLUT

11

12 February 12, 1999 Architecture and Circuits: 12 Bandwidth Hierarchy Provide lots of bandwidth where its inexpensive –short wires between ALUs Moderate bandwidth with intermediate cost –local RAM associated with each ALU cluster Low bandwidth where its expensive –Global RAM with long wires Very low bandwidth off chip Global on-chip RAM Local RAM ALU Cluster ALU Cluster ALU Cluster ALU Cluster off chip global 30mm medium 4mm local 1mm

13 February 12, 1999 Architecture and Circuits: 13 Bandwidth Hierarchy A key problem is to match the demands of an application to the bandwidth available at each level of the hierarchy Casting applications in a streaming model exposes much of the locality necessary to exploit the hierarchy Global on-chip RAM Local RAM ALU Cluster ALU Cluster ALU Cluster ALU Cluster

14 February 12, 1999 Architecture and Circuits: 14 Architecture Research Issues Processor architecture –configuration of ALUs clustered vs distributed –method for controlling ALUs distributed control, VLIW, SIMD –communication aware instruction sets how to hide details while exposing communication Memory architecture –methods for exploiting 2D spatial locality –communication aware cache organizations Communication Architecture –on-chip interconnection networks –the use of repeaters with switching –the use of hierarchy and selective ‘fat’ wires

15 February 12, 1999 Architecture and Circuits: 15 Circuit Challenges of Slow Interconnect The clock cycle is dominated by wire delay –novel circuits to improve effective signal velocity Power is largely used to drive wires –low-swing on-chip signaling methods –reject rather than overpower noise Its difficult to distribute a global clock –locally synchronous design methods –fast synchronizers no wait for metastable decay

16 February 12, 1999 Architecture and Circuits: 16 Overdrive gives 3x improvement in RC wire latency

17 February 12, 1999 Architecture and Circuits: 17 Low-Swing Overdrive Signaling 1V Swing at Source 300mV Swing at Receiver Recovered Signal

18 February 12, 1999 Architecture and Circuits: 18 Conclusion Exploit, Don’t Fight, The Technology Interconnect is rapidly dominating the delay, power, and area of ICs Traditional architectures rely on global communication –they are ill-suited for an interconnect-dominated technology Emerging architectures expose communication and exploit locality –distributed register files and instruction dispatch –bandwidth hierarchy Novel circuits can mitigate effects of slow wires –overdrive, low-swing signaling, locally synchronous design


Download ppt "February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University."

Similar presentations


Ads by Google