On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

2 © Alvin R. Lebeck 1999 CPS 221 Administrivia Today simultaneous multithreading, MP on a chip project presentations (10-15 minutes) midterm II, Wed April 29, in class project write-up due Friday May 1 Noon –approximately 8 pages

3 © Alvin R. Lebeck 1999 CPS 221 Review: Software Coherence Protocols Requires Access Control Messaging System –small control messages –large bulk transfer Programmable Processor –Support for Protocol operations Questions Kernel-based vs. User-Level? Integration of processor with other requirements?

4 © Alvin R. Lebeck 1999 CPS 221 Review: Typhoon Fully Integrated (processor, access control, NI) Mem P $ P $ RTLB NI P $ P $ P $

5 © Alvin R. Lebeck 1999 CPS 221 Software Fine-Grain Access Control Low cost, can run on network of workstations Flexibility of Software protocol processing Like SW Dirty Bits, but more general Foreach load/store, check access bits –if access fault invoke fault handler Lookup Options –table lookup (Blizzard-S) –magic cookie (Shasta, Blizzard-COW) Instrumentation Options –compiler –executabe editing

6 © Alvin R. Lebeck 1999 CPS 221 Blizzard-S Supports Tempest Interface Executable Editing (EEL) Fast Table Lookup –mask, shift, add

7 © Alvin R. Lebeck 1999 CPS 221 Shasta Executable Editing (variant of ATOM) Magic Cookie ld r1, r2[300] if r1 == magic_cookie do_out_of_line_check(x); add r3, r1, r4 Incorporates several optimizations –code scheduling –batching checks (refs to same cache lines) –3% overhead on uniprocessor code Multiple coherence granularity Supports Release Consistency

8 © Alvin R. Lebeck 1999 CPS 221 Future Directions Simultaneous Multithreading Single-Chip MP MultiScalar Processors (Wednesday)

9 © Alvin R. Lebeck 1999 CPS 221 Multithreaded Processors Exploit thread-level parallelism to improve performance –Multiple Program Counters Thread –independent programs (multiprogramming) –threads from same program

10 © Alvin R. Lebeck 1999 CPS 221 Deneclor HEP General purpose scientific computer Organized as MP –up to 16 processors –each processor multithreaded –up to 128 memory modules –up to 4 I/O cache modules –Three-input switches and chaotic routing

11 © Alvin R. Lebeck 1999 CPS 221 HEP Processor Organization Multiple contexts (threads) –each has own Program Status Word (PSW) PSWs circulate in control loop –control and data loops pipelined 8 deep –PSW in control can circulate no faster than data in data loop –PSW at queue head fetches and starts execution of next instruction Clock period: 100ns –8 PSWs in control loop => 10MIPS –Each thread gets 1/8 the processor –Maximum performance per thread => 1.25 MIPS (And they tried to sell as supercomputer)

12 © Alvin R. Lebeck 1999 CPS 221 Simultaneous Multithreading Goal: use hardware resources more efficiently –especially for superscalar processors Assume 4-issue superscalar Thread Instruction Horizontal WasteVerticle Waste

13 © Alvin R. Lebeck 1999 CPS 221 Operation of Simultaneous Multithreading Standard multithreading can reduce verticle waste Issue from multiple threads in same cock cycle Eliminate both horizontal and verticle waste Thread Instructions Simultaneous MultithreadingStandard Multithreading

14 © Alvin R. Lebeck 1999 CPS 221 Limitations of SuperScalar Architectures Instruction Fetch –branch prediction –alignment of packet of instructions Dynamic Instruction Issue Need to identify ready instructions Rename Table –No compares –Large number of ports (Operands x Width) Reorder Buffer –n x Q x O x W 1 bit comparators (src and dest) –Quadratic increase in queue size with issue width –PA-8000 20% of die area to issue queue (56 instruction window)

15 © Alvin R. Lebeck 1999 CPS 221 SuperScalar Limitations (Continued) Instruction Execute Register File –more rename registers –more access ports –complexity quadratic with issue width Bypass logic –complexity quadratic with issue width –wire delays Functional Units –replicate –add ports to data cache (complexity adds to access time)

16 © Alvin R. Lebeck 1999 CPS 221 Why Single Chip MP? Technology Push –Benefits of wide issue are limited –Decentralized microarchitecture: easier to build several simple fast processors than one complex processor Application Pull –Applications exhibit parallelism at different grains –< 10 instructions per cycle (Integer codes) –> 40 instructions per cycle (FP loops)

17 © Alvin R. Lebeck 1999 CPS 221 A 6-Way SuperScalar Processor Integer Unit L2 Cache (256 KB) I-Cache (32 KB) TLB D-Cache (32 KB) External Interface Instruction Fetch Clocking & Pads Instruction Decode & Rename Reorder Buffer, Instruction Queues, and Out-of-Order Logic Floating Point Unit 21 mm

18 © Alvin R. Lebeck 1999 CPS 221 A 4 x 2 Single Chip Multiprocessor L2 Communication Crossbar L2 Cache (256 KB) External Interface Clocking & Pads 21 mm Dcache 1 Dcache 3 Dcache 2 Dcache 4 Icache 1Icache 2 Icache 3Icache 4 Processor #1 Processor #2 Processor #3 Processor #4 21 mm

20 © Alvin R. Lebeck 1999 CPS 221 Summary of Performance 4 x 2 MP works well for coarse grain apps –How well would Message Passing Architecture do? –Can SUIF handle pointer intensive codes? For “tough” codes 6-way does slightly better, but neither is > 60% better than 2-issue

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

Similar presentations

Presentation on theme: "On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

Similar presentations

Presentation on theme: "On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2."— Presentation transcript:

Similar presentations

About project

Feedback