Review: Multiprocessor Basics

Review: Multiprocessor Basics
Q1 – How do they share data? Q2 – How do they coordinate? Q3 – How scalable is the architecture? How many processors? # of Proc Communication model Message passing 8 to 2048 Shared address NUMA 8 to 256 UMA 2 to 64 Physical connection Network Bus 2 to 36

CMP: Multiprocessors On One Chip
By placing multiple processors, their memories and the IN all on one chip, the latencies of chip-to-chip communication are drastically reduced ARM multi-chip core Configurable # of hardware intr Private IRQ Interrupt Distributor Per-CPU aliased peripherals CPU Interface CPU Interface CPU Interface CPU Interface Configurable between 1 & 4 symmetric CPUs CPU L1$s CPU L1$s CPU L1$s CPU L1$s Incorporated a power-mgmt scheme, IEM, that controls power over the whole core (in active state) until multiple voltages and frequencies. Also has adaptive shutdown states that allow varying levels of individual core shutdown (including a “wait for interrupt” shutdown state where the core is still powered on but inactive waiting for an interrupt, a “dormant” state where the core is shutdown but the RAM retention voltage is maintained, and a third “shutdown” state where the whole core is shutdown and powered off). Performance comes from parallelism rather than via clock frequency. The L1 caches used a modified version of MESI. Private peripheral bus Snoop Control Unit I & D 64-b bus CCB Primary AXI R/W 64-b bus Optional AXI R/W 64-b bus

Multithreading on A Chip
Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread The caches, buffers can be shared (although the miss rates may increase if they are not sized accordingly) The memory can be shared through virtual memory mechanisms Hardware must support efficient thread context switching

Types of Multithreading
Fine-grain – switch threads on every instruction issue Round-robin thread interleaving (skipping stalled threads) Processor must be able to switch threads on every clock cycle Advantage – can hide throughput losses that come from both short and long stalls Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses) Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss Pipeline must be flushed and refilled on thread switches

Multithreaded Example: Sun’s Niagara (UltraSparc T1)
Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction) Ultra III Niagara Data width 64-b Clock rate 1.2 GHz 1.0 GHz Cache (I/D/L2) 32K/64K/ (8M external) 16K/8K/3M Issue rate 4 issue 1 issue Pipe stages 14 stages 6 stages BHT entries 16K x 2-b None TLB entries 128I/512D 64I/64D Memory BW 2.4 GB/s ~20GB/s Transistors 29 million 200 million Power (max) 53 W <60 W 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe Crossbar I/O shared funct’s L1 caches support only two coherent states: valid and invalid lines. The L1 data cache is write-through, so there is no invalid state. The L2 cache keeps a directory of all eight L1 caches and can invalidate lines that are modified (using the MESI protocol). Other notes: the UltraSparc T1 has only one FPU, making this chip pretty bad at scientific codes. 4-way banked L2$ Memory controllers

Niagara Integer Pipeline
Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient Fetch Thrd Sel Decode Execute Memory WB RegFilex4 ALU Mul Shft Div D$ DTLB Stbufx4 Crossbar Interface Inst bufx4 Thrd Sel Mux I$ ITLB Decode Instr type Thread Select Logic Cache misses No speculative execution. Since the pipeline is short and there are multiple threads per core, branch prediction is unnecessary. The core can hide the time required to fetch the new instruction stream on a taken branch by switching to another thread during the clock delay. Register has eight register windows with three read ports and two write ports. Threads are issued round-robin, but stalled threads will get priority when they are ready to resume. Traps & interrupts Thrd Sel Mux Resource conflicts PC logicx4

Simultaneous Multithreading (SMT)
A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread-level parallelism (TLP) Most have more machine level parallelism than most programs can effectively use (i.e., than have ILP) With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them Need separate rename tables (ROBs) for each thread Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle Intel’s Pentium 4 SMT called hyperthreading Supports just two threads (doubles the architecture state)

Threading on a 4-way SS Processor Example
Coarse MT Fine MT SMT Issue slots → Thread A Thread B Time → Thread C Thread D Coarse MT takes 27 cycles to complete (Assumes that coarse MT takes one cycle start-up time (optimistic).) Fine MT takes 25 cycles to complete. SMT takes 14 cycles to complete.

Multicore Xbox360 – “Xenon” processor
To provide game developers with a balanced and powerful platform Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache 165M transistors total 3.2 Ghz Near-POWER ISA 2-issue, 21 stage pipeline, with bit registers Weak branch prediction – supported by software hinting In order instructions Narrow cores – 2 INT units, bit VMX units, 1 of anything else An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM 337M transistors, 10MB framebuffer 48 pixel shader cores, each with 4 ALUs Things to note: the 32-bit Power ISA supports 32 registers natively. Moving to 128 registers requires ‘cramming’ 7-bit register operands in. No one knows how they do it, but it’s quirky. The branch predictor is quite simple, and my guess is that it’s either a 1-bit predictor or a small 2-bit predictor. Microsoft has presented a number of papers on how software hinted and compiler supported branch prediction can help. A “VMX” unit is the colloquial term for the SIMD operations similar to AltiVec we see on board. This one is custom modified to support Direct3D data format packing and unpacking. Other notes: the GPU is twice as big as the CPU. The 10MB framebuffer is an off-chip high-speed memory explicitly for full-screen anti-aliasing. In FSAA, you need to do 5 reads and 1 write per pixel, which quickly floods any memory subsystem. Instead, they build it into the framebuffer itself, which is a very fast little chip that does nothing but hold the image and smooth it out.

Xenon Diagram Core 0 Core 1 Core 2 1MB UL2 512MB DRAM GPU DVD HDD Port
L1D L1I Core 1 Core 2 1MB UL2 512MB DRAM GPU BIU/IO Intf 3D Core 10MB EDRAM Video Out MC0 MC1 Analog Chip XMA Dec SMC DVD HDD Port Front USBs (2) Wireless MU ports (2 USBs) Rear USB (1) Ethernet IR Audio Out Flash Systems Control Video Out It is important to note the way that data can be streamed from the L2 cache to the GPU. In particular, the L2 can have banks ‘locked’ away from normal use, and allowed for direct-FIFO access to the GPU. This allows the processor to stream data into the GPU very efficiently, without clogging up the cache, and ensuring optimal bandwidth usage. This is especially useful in "procedural synthesis", where a template object (such as a tree) is programmatically modified slightly each time it is drawn, to make it look natural. The locked cache allows FIFO streaming of such objects to the GPU without reducing available bandwidth to the processor, and without trashing the cache. Also of note is that if you run two of the three processors at full-tilt, it's just enough to feed the GPU at full-rate. The system was meant for 6 threads, four of which are graphics threads doing procedural synthesis and the like.

The PS3 “Cell” Processor Architecture
Composed of a Non-SMP Architecture 234M 4Ghz 1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s 512KB L2 $ - Massively high bandwidth (200GB/s) bus connects it to everything else The PPE is strangely similar to one of the Xenon cores Almost identical, really. Slight ISA differences, and fine-grained MT instead of real SMT The real differences lie in the SPEs (21M transistors each) An attempt to ‘fix’ the memory latency problem by giving each processor complete control over it’s own 256KB “scratchpad” – 14M transistors Direct mapped for low latency 4 vector units per SPE, 1 of everything else – 7M trans. Marketing-related info: the PPE is /so/ similar to the Xenon that other than some specialized SIMD instructions, code is near compatible. (Instruction length also differs, but that's a 'minor' issue). What really matters is that Microsoft has a real leg up on the 'mental pull' to developers. The reason is that code that's developed on the Xenon will compile and run, with very few modifications, on the PPE of the Cell. As such, Xenon has 3 "PPE-style" processors, allowing the primary development path to be MS-based. After all, once you get the game working with the much more comfortable Xenon architecture, you can then try to put some rough segments onto the SPE's, and hope for some speedup. The trick is that this way, most of the development time will be in a Xenon-native development, rather than Cell-native. This gives the dev-team more time to optimize the Xenon code, and more importantly tends to increase the amount of code that will eventually run on the PPE. A full Cell development process would start with the SPE sub-programs, but since that isn't a portable development process on either the Xbox or the Revolution, MS is hoping developers won't use it. By short-circuiting the PS3 development process by providing such a compatible and comfortable platform, MS is hoping to reduce utilization of the SPEs, and over-reliance on the PPE, reducing the Cell's functional utilization.

How to make use of the SPEs
Note that this process requires 8 SPEs, and only 7 are enabled in the PS3's Cell. As such, some routines must be run on the same SPE, resulting in lower performance. Also note that the memory subsystem on your average desktop machine is around 6.5 GB/s. The graphics memory on your high-end video card gives maybe 25GB/s. The bus transmitting all of that data gives 200GB/s, enough for the PPE and all 7 SPE's to run at 25GB/s on the "EIM" (Element Interface Bus), which allows all of this performance to happen. That bus is a 3-segment 96B/cycle bus, and really is the backbone of the design. Without it, none of this would matter.

What about the Software?
Makes use of special IBM “Hypervisor” Like an OS for OS’s Runs both a real time OS (for sound) and non-real time (for things like AI) Software must be specially coded to run well The single PPE will be quickly bogged down Must make use of SPEs wherever possible This isn’t easy, by any standard What about Microsoft? Development suite identifies which 6 threads you’re expected to run Four of them are DirectX based, and handled by the OS Only need to write two threads, functionally

Review: Multiprocessor Basics

Similar presentations

Presentation on theme: "Review: Multiprocessor Basics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Review: Multiprocessor Basics

Similar presentations

Presentation on theme: "Review: Multiprocessor Basics"— Presentation transcript:

Similar presentations

About project

Feedback