Presentation on theme: "1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,"— Presentation transcript:
1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara, CA) ISSCC 2006 Instructor: Dr. S. M. Fakhraie Provided by: Nayere Ghobadi Fall 2006 Advanced VLSI Class Presentation
2 Outline Multi-core processors Cache Xeon processors Dual-Core Multi-Threaded Xeon Processor Features 16MB L3 cache Clock Generation and Distribution Voltage supplies Processor Package Front-side bus (FSB) Protection Temperature sensing Summary and conclusion
3 Multi-core Processors Is one that combines two or more independent processors into a single package, often a single IC. Exhibit some form of thread-level parallelism (TLP). Diagram of an Intel Core 2 dual core processor (from)
4 Multi-core Processors Cont. Advantages: 1. Signals don’t have to travel off-chip, so cache coherency circuitry can operate at a much higher clock rate. 2. Require much less space than multi-chip designs. 3. Slightly less power than two coupled single-core processors.
5 Multi-core Processors Cont. Disadvantages: 1. In addition to OS support, adjustments to existing software are required to maximize utilization of the computing resources provided by multi-core processors. 2. Drive production yields down and they are more difficult to manage thermally.
6 Cache A temporary storage area where frequently accessed data can be stored for rapid access. If the processor finds the desired memory location in the cache, This situation is known as a cache hit, otherwise it is cache miss. The proportion of accesses that result in a cache hit is known as the hit rate. Diagram of a CPU memory cache (from)
7 Cache Cont. Multi-level caches: There is a tradeoff between cache latency and hit rate. Larger caches have better hit rates but longer latency. So many computers use multiple levels of cache, with small fast caches backed up by larger slower caches.
8 Xeon Processors The Xeon is intel's brand name for its server-class PC microprocessor intended for multiple-processor machines. Generally have more cache and support larger multiprocessor configurations than their desktop counterparts. Xeon processor and logo (from)
9 Dual-Core Multi-Threaded Xeon Processor Features Two 64b cores. Each core has: 1. Two threads 2. A unified 1MB L2 cache 16MB unified L3 cache A simple direct interface between core and front- side bus (FSB) for minimizing: 1. L3 cache latency. 2. External bus latency. Block diagram (from)
10 Dual-Core Multi-Threaded Xeon Processor Features Cont. Caching FSB controller for handling: 1. Core arbitration. 2. L3 cache accesses. 3. External bus requests. The processor die is 435mm 2 with 1.328B transistors. Operates at more than 3.0GHz from a 1.25V core supply. Die micrograph (from)
11 Dual-Core Multi-Threaded Xeon Processor Features Cont. The worst-case power dissipation is 165W (power dissipation on a typical server workload is 110W). 65nm process Technology. Eight copper interconnect layers. Low-k carbon-doped oxide (k=2.9) inter-level dielectric. 65nm process technology summary (from)
12 16MB L3 Cache 6T SRAM Cell Read: Precharge both bitlines high Raise wordline One of the two bitlines will be pulled down by the cell Write: Drive one bitline high, the other low Raise wordline Bitlines overpower cell with new value 6T memory cell (from)
13 16MB L3 Cache Cont. 256 data sub-arrays (64kB each). Each data sub-array stores 32 bits. 32 redundancy sub-arrays (68kB each). Each redundancy sub-array store 34 bits. Is Composed of 6T memory- cells with the size of 0.624µm 2. Physical address is 40b wide. Only 0.8% of all array blocks are powered up for each cache access for reducing active power. L3 cache block (from)
14 16MB L3 Cache Cont. Sleep circuit Active mode: Virtual V ss =V ss Full voltage swing. Sleep mode: Virtual V ss = 250mV. Reducing the leakage by 2X. Shut-off mode: NMOS shut-off device is turned off. Virtual V ss = V cc /2. Reducing the leakage by 4X. L3 cache sleep circuit and shut-off mode (from)
15 Clock Generation and Distribution The critical clocking features of this processor are: 1. multiple clock domains with different frequencies. 2. dedicated core and uncore voltage domains. Separate PLLs and clock distribution trees for each core and the associated L2 cache. A third PLL for the uncore half-frequency clock. De-skew circuits controlled by on-die fuses reduce the uncore clock skew to less than 11ps.
16 Clock Generation and Distribution Cont. System clock (BCLK) = 200MHz. Cores clock (MCLK) = BCLK×N. MCLK can be more than 3.0GHz at a 1.25V core supply voltage (V core ). Uncore clock (SCLK) = 1/2MCLK. Using a separate uncore voltage supply (V cache ). FSB clock (ZCLK) = BCLK×4 (quad pumping) Clock distribution map (from)
17 Voltage Supplies Three voltage supplies are used for: 1. Two cores. 2. L3 cache together with the associated control logic. 3. The FSB I/O circuits. Level shifters are used between voltage domains. A custom tool checks for presence and correct connectivity of level shifters on all signals that cross voltage domain boundaries.
18 Voltage Supplies Cont. Voltage domains and power breakdown (from)
19 Processor Package The processor is flip-chip or Controlled Collapse Chip Connection (C4). The processor die has 13164 C4 solder bumps. Is attached to a 12-layer (4-4-4) organic package with an integrated heat spreader. The package has 604 pins. 238 pins are signal pins and the rest are power and ground. The chip-level power distribution consists of a uniform M8-M7 grid synchronized with the C4 power and ground bump array.
20 Front-Side Bus (FSB) Operates at 800MT/s. A symmetric pre-driver design for controlling the edge rate to meet timing and signal integrity requirements: 1. Dividing the FSB output (V OL to V OH ) into six voltage levels. 2. Each driven by an output driver segment with different R ON value. 3. When a segment is enabled, it forms a parallel resistance to the previously enabled segments. 4. A new voltage level is generated, thus creating a stair- case-like waveform in every transition.
22 Protection Using bit interleaving for adjacent cache lines To prevent multiple bit errors caused by a single upset event in the same cache line. L3 data and tag arrays and L2 data array have Error- correction code (ECC) protection L2 tag has parity checking. A dynamic 32-entries cache line disable mechanism protects the L3 cache from erratic bits and infant mortality failures.
23 Temperature Sensing Three diodes for temperature sensing: One in each core. routed to an on-package temperature-monitor chip. provide temperature data to the system for fan speed control. One between the two cores. is routed to pins for system use. A temperature sensor near the hot spot in each core, provides a digital temperature readout that is used in conjunction with operating-system power-state requests to make informed throttle and boost decisions.
24 Summary and Conclusion Dual-core multi-threaded Xeon processor in 65nm process Technology. The processor is flip-chip (C4). Has Two 64b cores. Each core has Two threads and A unified 1MB L2 cache. Has 16MB unified L3 cache Operates at more than 3.0GHz from a 1.25V core supply Three voltage supplies The processor FSB Operates at 800MT/s
25 References  S. Rusu, S. Tam, “A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache”, IEEE ISSCC Tech. Digest, p118, 2006.  S. Tam, J. Leung, “Clock Generation and Distribution of a Dual-Core Xeon® Processor with 16MB L3 Cache”, IEEE ISSCC Tech. Digest, p382, 2006.  “Dual-Core Intel® Xeon® Processor 7100 Series Datasheet”, Intel Corporation, September 2006.  S. Tam, et al., “Clock Generation and Distribution for the Third Generation Itanium® Processor,” Symp. VLSI Circuits, pp. 9-12, Jun., 2003.  N. H. E. Weste, D. Harris, “ CMOS VLSI Design”,Pearson Education Inc., 2005.  Wikipedia, The free encyclopedia. Available: http://en.wikipedia.org/