Fetch Utilizes a hardware prefecther Supports four threads of execution – Separate register files for each thread – Switches threads in order to cover cases where the compiler is unable to schedule code without stalls or if the prefetcher has not received new instructions – Inactive thread data is written to the cores local L2 cache
Pipeline Pipeline derived from the dual-issue Pentium processor, which is 5-stages – Short, inexpensive execution pipeline Pairing rules for primary and secondary instruction pipes are deterministic – Allows compilers to perform offline analysis with a wide scope
Pipeline Pairing rules for primary and secondary instruction pipes are deterministic – Allows compilers to perform offline analysis with a wide scope All instructions can be issued on the primary pipeline – Minimizes the combinational problems for a compiler Secondary pipeline can execute a large x86 instruction set – Small and cheap – Power wasted by failing to dual-issue on every cycle is minimal
Pipeline Each core has own pipeline – Based upon the 5 stage Pentium – Dual issues instructions – In order execution Pipeline is shared between threads – Hardware can switch between threads that have instructions that have instructions ready to execute
Pipeline Designed software-rendering pipeline to minimize the number of locks and other synchronization events Graphics-rendering pipeline written with high- level languages and tools – Enables developers to add innovative rendering capabilities
Vector Processor Unit 16-wide vector processor unit (VPU) – executes integer, single-precision float, and double-precision float instructions – VPU and register are approximately one-third the area of the processor core Tradeoff – Increased computational density – Wider VPUs have higher utilization
Vector Processor Unit VPU instructions can be predicated by a mask register Mask controls which parts of a vector register or memory location are written and which are left untouched Advantages – Reduces branch misprediction penalties – Gives instruction scheduler greater freedom
Number of Cores Many-core processor – Planned to have 24 to 48 cores
System On-Chip Components x86 computer cores - Dual issue, in order processors that support the x86 protocol with Larrabee extensions. Connected to ring network and high bandwidth connection to adjacent L2 Cache subset.
System On-Chip Components L2 Cache subsets – High bandwidth access to adjacent CPU – Connected directly to the ring network – Coherent cache, uses the ring network to check coherency when allocating new cache lines
System On-Chip Components Ring Network Nodes – Simple bi-directional routers with a 512 bit data path in each direction (1024 bit total bandwidth) – Organized in rings of 8-16 cores and other devices – Interconnected with other rings – All data moved between cores and fixed functional units passes through the ring network
System On-Chip Components Fixed function logic components – Provides rasterization, interpolation and other commonly needed functions – Directly connected to the ring network – Will be spread among the cores to provide lower latency and load balancing on the ring network
System On-Chip Components Memory & I/O interface – Provides and manages communication between the Ring Network and off chip devices. – Manages initial routing and tasking of cores
Pros: Straightforward, not complex Able to deliver high bandwidth Great performance if memory clients need high bandwidth. Cons: Waste of chip area if most applications dont need high memory bandwidth That area could be spent elsewhere to increase performance in a different way.
Multithreading Organization Superscalar In-Order Four Threads of execution Dual issue (with a vector processing unit)
Comparison to OO Execution # CPU cores:2 out-of-order10 in-order Instruction issue:4 per clock2 per clock VPU per core:4-wide SSE16-wide L2 cache size:4 MB Single-stream:4 per clock2 per clock Vector throughput: 8 per clock160 per clock
Scheduling Policy Software Controlled More flexible due to the software controlled scheduling than a typical GPU.
Software Controlled Scheduling Pros Flexible: can choose the scheduler to suit the application. Worst case wont be so bad. (As compared to a hardware encoded scheduling policy) Cons Overhead of scheduler takes a bite out of performance Programmer overhead of selecting the correct scheduler.
Criticism NVIDIA – like a GPU from 2006 – Unrealistic performance projections – Motivated by interest to retain market share
Possible Market Dreamworks Animation Xbox / Playstation Scientific research