Presentation is loading. Please wait.

Presentation is loading. Please wait.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Similar presentations


Presentation on theme: "On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252."— Presentation transcript:

1 On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252

2 2 © Alvin R. Lebeck 2006 CPS 220 Administrivia Projects Presentations Dec 5 & 7 Documents ~10 pages –Good writing is important –Progress is important Final is Dec 11 (7pm to 10pm)

3 3 © Alvin R. Lebeck 2006 CPS 220 Multithreaded Processors Exploit thread-level parallelism to improve performance –Multiple Program Counters Thread –independent programs (multiprogramming) –threads from same program

4 4 © Alvin R. Lebeck 2006 CPS 220 Deneclor HEP General purpose scientific computer Organized as MP –up to 16 processors –each processor multithreaded –up to 128 memory modules –up to 4 I/O cache modules –Three-input switches and chaotic routing

5 5 © Alvin R. Lebeck 2006 CPS 220 HEP Processor Organization Multiple contexts (threads) –each has own Program Status Word (PSW) PSWs circulate in control loop –control and data loops pipelined 8 deep –PSW in control can circulate no faster than data in data loop –PSW at queue head fetches and starts execution of next instruction Clock period: 100ns –8 PSWs in control loop => 10MIPS –Each thread gets 1/8 the processor –Maximum performance per thread => 1.25 MIPS (And they tried to sell as supercomputer)

6 6 © Alvin R. Lebeck 2006 CPS 220 Simultaneous Multithreading Goal: use hardware resources more efficiently –especially for superscalar processors Assume 4-issue superscalar Alpha 21464 Thread Instruction Horizontal WasteVerticle Waste

7 7 © Alvin R. Lebeck 2006 CPS 220 Operation of Simultaneous Multithreading Standard multithreading can reduce verticle waste Issue from multiple threads in same cock cycle Eliminate both horizontal and verticle waste Larger Register Files Thread Instructions Simultaneous MultithreadingStandard Multithreading

8 8 © Alvin R. Lebeck 2006 CPS 220 Limitations of SuperScalar Architectures Instruction Fetch –branch prediction –alignment of packet of instructions Dynamic Instruction Issue Need to identify ready instructions Rename Table –No compares –Large number of ports (Operands x Width) Issue Queue Size –n x Q x O x W 1 bit comparators (src and dest) –Quadratic increase in queue size with issue width –PA-8000 20% of die area to issue queue (56 instruction window)

9 9 © Alvin R. Lebeck 2006 CPS 220 SuperScalar Limitations (Continued) Instruction Execute Register File –more rename registers –more access ports –complexity quadratic with issue width Bypass logic –complexity quadratic with issue width –wire delays Functional Units –replicate –add ports to data cache (complexity adds to access time)

10 10 © Alvin R. Lebeck 2006 CPS 220 Why Single Chip MP? Technology Push –Benefits of wide issue are limited –Decentralized microarchitecture: easier to build several simple fast processors than one complex processor Application Pull –Applications exhibit parallelism at different grains –< 10 instructions per cycle (Integer codes) –> 40 instructions per cycle (FP loops)

11 11 © Alvin R. Lebeck 2006 CPS 220 A 6-Way SuperScalar Processor Integer Unit L2 Cache (256 KB) I-Cache (32 KB) TLB D-Cache (32 KB) External Interface Instruction Fetch Clocking & Pads Instruction Decode & Rename Reorder Buffer, Instruction Queues, and Out-of-Order Logic Floating Point Unit 21 mm

12 12 © Alvin R. Lebeck 2006 CPS 220 A 4 x 2 Single Chip Multiprocessor L2 Communication Crossbar L2 Cache (256 KB) External Interface Clocking & Pads 21 mm Dcache 1 Dcache 3 Dcache 2 Dcache 4 Icache 1Icache 2 Icache 3Icache 4 Processor #1 Processor #2 Processor #3 Processor #4 21 mm

13 13 © Alvin R. Lebeck 2006 CPS 220 Performance Comparison

14 14 © Alvin R. Lebeck 2006 CPS 220 Summary of Performance 4 x 2 MP works well for coarse grain apps –How well would Message Passing Architecture do? –Can SUIF handle pointer intensive codes? For “tough” codes 6-way does slightly better, but neither is > 60% better than 2-issue


Download ppt "On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252."

Similar presentations


Ads by Google