Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Structure 2012 – Uncore 1 Computer Structure The Uncore.

Similar presentations


Presentation on theme: "Computer Structure 2012 – Uncore 1 Computer Structure The Uncore."— Presentation transcript:

1 Computer Structure 2012 – Uncore 1 Computer Structure The Uncore

2 Computer Structure 2012 – Uncore 2 2 nd Generation Intel® Core™ Integrated Memory Controller – 2ch DDR3 Integrated Memory Controller – 2ch DDR3 High Bandwidth Last Level Cache High Bandwidth Last Level Cache Next Generation Graphics and Media Next Generation Graphics and Media Next Generation Intel ® Turbo Boost Technology Next Generation Intel ® Turbo Boost Technology Intel ® Hyper-Threading Technology 4 Cores / 8 Threads 2 Cores / 4 Threads Intel ® Hyper-Threading Technology 4 Cores / 8 Threads 2 Cores / 4 Threads Integrates CPU, Graphics, MC, PCI Express* on single chip Integrates CPU, Graphics, MC, PCI Express* on single chip Embedded DisplayPort Embedded DisplayPort Substantial performance improvement Substantial performance improvement Intel ® Advanced Vector Extension (Intel ® AVX) Intel ® Advanced Vector Extension (Intel ® AVX) High BW/ low-latency core/GFX interconnect High BW/ low-latency core/GFX interconnect Discrete Graphics Support: 1x16 or 2x8 Discrete Graphics Support: 1x16 or 2x8 Foil taken from IDF ch DDR3 ×16 PCIe Graphics Core LLC Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express* IMC PCH

3 Computer Structure 2012 – Uncore 3 3 rd Generation Intel Core TM  22nm process  Quad core die, with Intel HD Graphics 4000  1.4 Billion transistors  Die size: 160 mm 2

4 Computer Structure 2012 – Uncore 4 The Uncore Subsystem  The SoC design provides a high bandwidth bi-directional ring bus –Connect between the IA cores and the various un-core sub-systems  The uncore subsystem includes –A system agent –The graphics unit (GT) –The last level cache (LLC)  In Intel Xeon Processor E5 Family –No graphics unit (GT) –Instead it contains many more components:  An LLC with larger capacity and snooping capabilities to support multiple processors  Intel® QuickPath Interconnect interfaces that can support multi-socket platforms  Power management control hardware  A system agent capable of supporting high bandwidth traffic from memory and I/O devices From the Optimization Manual Graphics Core LLC Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express* IMC

5 Computer Structure 2012 – Uncore 5 Ring-based interconnect between Cores, Graphics, Last Level Cache (LLC) and System Agent domain Graphics Core LLC Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express* IMC Scalable Ring On-die Interconnect Composed of 4 rings –32 Byte Data ring, Request ring, Acknowledge ring and Snoop ring –Fully pipelined at core frequency/voltage: bandwidth, latency and power scale with cores Massive ring wire routing runs over the LLC with no area impact Access on ring always picks the shortest path – minimize latency Distributed arbitration, ring protocol handles coherency, ordering, and core interface Scalable to servers with large number of processors High Bandwidth, Low Latency, Modular Foil taken from IDF 2011

6 Computer Structure 2012 – Uncore 6 Last Level Cache – LLC  The LLC consists of multiple cache slices –The number of slices is equal to the number of IA cores –Each slice contains a full cache port that can supply 32 bytes/cycle  Each slice has logic portion + data array portion –The logic portion handles  Data coherency  Memory ordering  Access to the data array portion  LLC misses and write-back to memory –The data array portion stores cache lines  May have 4/8/12/16 ways  Corresponding to 0.5M/1M/1.5M/2M block size  The GT sits on the same ring interconnect –Uses the LLC for its data operations as well –May in some case competes with the core on LLC From the Optimization Manual Graphics Core LLC Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express* IMC

7 Computer Structure 2012 – Uncore 7 Cache Box Interface block –Between Core/Graphics/Media and the Ring –Between Cache controller and the Ring –Implements the ring logic, arbitration, cache controller –Communicates with System Agent for LLC misses, external snoops, non-cacheable accesses Full cache pipeline in each cache box –Physical Addresses are hashed at the source to prevent hot spots and increase bandwidth –Maintains coherency and ordering for the addresses that are mapped to it –LLC is fully inclusive with “Core Valid Bits” – eliminates unnecessary snoops to cores –Per core CVB indicates if core needs to be snooped for a given cache line Runs at core voltage/frequency, scales with Cores Distributed coherency & ordering; Scalable Bandwidth, Latency & Power Distributed coherency & ordering; Scalable Bandwidth, Latency & Power Foil taken from IDF 2011 Graphics Core LLC Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express* IMC

8 Computer Structure 2012 – Uncore 8 Ring Interconnect and LLC  The physical addresses of data kept in the LLC are distributed among the cache slices by a hash function –Addresses are uniformly distributed –From the cores and the GT view, the LLC acts as one shared cache  With multiple ports and bandwidth that scales with the number of cores –The number of cache-slices increases with the number of cores  The ring and LLC are not likely to be a BW limiter to core operation –From SW point of view, this does not appear as a normal N-way cache –The LLC hit latency, ranging between cycles, depends on  The core location relative to the LLC block (how far the request needs to travel on the ring)  All the traffic that cannot be satisfied by the LLC, still travels through the cache-slice logic portion and the ring, to the system agent –E.g., LLC misses, dirty line writeback, non-cacheable operations, and MMIO/IO operations From the Optimization Manual

9 Computer Structure 2012 – Uncore 9 LLC Sharing LLC is shared among all Cores, Graphics and Media –Graphics driver controls which streams are cached/coherent –Any agent can access all data in the LLC, independent of who allocated the line, after memory range checks Controlled LLC way allocation mechanism prevents thrashing between Core/GFX Much higher Graphics performance, DRAM power savings, more DRAM BW available for Cores Multiple coherency domains –IA Domain (Fully coherent via cross-snoops) –Graphic domain (Graphics virtual caches, flushed to IA domain by graphics engine) –Non-Coherent domain (Display data, flushed to memory by graphics engine) Foil taken from IDF 2011 Graphics Core LLC Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express* IMC

10 Computer Structure 2012 – Uncore 10 Cache Hierarchy From the Optimization Manual  The LLC is inclusive of all cache levels above it –Data contained in the core caches must also reside in the LLC –Each LLC cache line holds an indication of the cores that may have this line in their L2 and L1 caches  Fetching data from LLC when another core has the data –Clean hit – data is not modified in the other core – 43 cycles –Dirty hit – data is modified in the other core – 60 cycles LevelCapacityways Line Size (bytes) Write Update Policy Inclusive Latency (cycles) Bandwidth (Byte/cyc) L1 Data32KB864Write-back-42 ×16 L1 Instruction32KB864N/A--- L2 (Unified)256KB864Write-backNo121 × 32 LLCVaries 64Write-backYes × 32

11 Computer Structure 2012 – Uncore 11

12 Computer Structure 2012 – Uncore 12

13 Computer Structure 2012 – Uncore 13

14 Computer Structure 2012 – Uncore 14

15 Computer Structure 2012 – Uncore 15

16 Computer Structure 2012 – Uncore 16

17 Computer Structure 2012 – Uncore 17

18 Computer Structure 2012 – Uncore 18

19 Computer Structure 2012 – Uncore 19

20 Computer Structure 2012 – Uncore 20

21 Computer Structure 2012 – Uncore 21

22 Computer Structure 2012 – Uncore 22

23 Computer Structure 2012 – Uncore 23

24 Computer Structure 2012 – Uncore 24

25 Computer Structure 2012 – Uncore 25 Data Prefetch to L2$ and LLC  Two HW prefetchers fetch data from memory to L2$ and LLC –Streamer and spatial prefetcher prefetch the data to the LLC –Typically data is brought also to the L2  Unless the L2 cache is heavily loaded with missing demand requests.  Spatial Prefetcher –Strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk  Streamer Prefetcher –Monitors read requests from the L1 caches for ascending and descending sequences of addresses  L1 D$ requests: loads, stores, and L1 D$ HW prefetcher  L1 I$ code fetch requests –When a forward or backward stream of requests is detected  The anticipated cache lines are pre-fetched  Prefetch-ed cache lines must be in the same 4K page From the Optimization Manual

26 Computer Structure 2012 – Uncore 26 Data Prefetch to L2$ and LLC  Streamer Prefetcher Enhancement –The streamer may issue two prefetch requests on every L2 lookup  Runs up to 20 lines ahead of the load request –Adjusts dynamically to the number of outstanding requests per core  Not many outstanding requests  prefetch further ahead  Many outstanding requests  prefetch to LLC only, and less far ahead –When cache lines are far ahead  Prefetch to LLC only and not to the L2$  Avoids replacement of useful cache lines in the L2$ –Detects and maintains up to 32 streams of data accesses  For each 4K byte page, can maintain one forward and one backward stream From the Optimization Manual

27 Computer Structure 2012 – Uncore 27 Lean and Mean System Agent Contains PCI Express*, DMI, Memory Controller, Display Engine… Contains Power Control Unit –Programmable uController, handles all power management and reset functions in the chip Smart integration with the ring –Provides cores/Graphics /Media with high BW, low latency to DRAM/IO for best performance –Handles IO-to-cache coherency Separate voltage and frequency from ring/cores, Display integration for better battery life Extensive power and thermal management for PCI Express* and DDR Smart I/O Integration Graphics Core LLC Core LLC Core LLC System Agent System Agent Display DMI PCI Express* IMC LLC Foil taken from IDF 2011

28 Computer Structure 2012 – Uncore 28 The System Agent  The system agent contains the following components –An arbiter that handles all accesses from the ring domain and from I/O (PCIe* and DMI) and routes the accesses to the right place –PCIe controllers connect to external PCIe devices  Support different configurations: x16+x4, x8+x8+x4, x8+x4+x4+x4 –DMI controller connects to the PCH chipset –Integrated display engine, Flexible Display Interconnect, and Display Port, for the internal graphic operations –Memory controller  All main memory traffic is routed from the arbiter to the memory controller –The memory controller supports two channels of DDR  Data rates of 1066MHz, 1333MHz and 1600MHz  8 bytes per cycle –Addresses are distributed between memory channels based on a local hash function that attempts to balance the load between the channels in order to achieve maximum bandwidth and minimum hotspot collisions From the Optimization Manual

29 Computer Structure 2012 – Uncore 29 The Memory Controller  For best performance –Populate both channels with equal amounts of memory  Preferably the exact same types of DIMMs –Using more ranks for the same amount of memory, results in somewhat better memory bandwidth  Since more DRAM pages can be open simultaneously –Use highest supported speed DRAM, with the best DRAM timings  The two memory channels have separate resources –Handle memory requests independently –Each memory channel contains a 32 cache-line write-data-buffer  The memory controller contains a high-performance out-of- order scheduler –Attempts to maximize memory bandwidth while minimizing latency –Writes to the memory controller are considered completed when they are written to the write-data-buffer –The write-data-buffer is flushed out to main memory at a later time, not impacting write latency From the Optimization Manual

30 Computer Structure 2012 – Uncore 30 The Memory Controller  Partial writes are not handled efficiently on the memory controller –May result in read-modify-write operations on the DDR channel  if the partial-writes do not complete a full cache-line in time –Software should avoid creating partial write transactions whenever possible and consider alternative  such as buffering the partial writes into full cache line writes  The memory controller also supports high-priority isochronous requests –E.g., USB isochronous, and Display isochronous requests  High bandwidth of memory requests from the integrated display engine takes up some of the memory bandwidth –Impacts core access latency to some degree From the Optimization Manual

31 Computer Structure 2012 – Uncore 31 Integration: Optimization Opportunities Dynamically redistribute power between Cores & Graphics Tight power management control of all components, providing better granularity and deeper idle/sleep states Three separate power/frequency domains: System Agent (Fixed), Cores+Ring, Graphics (Variable) High BW Last Level Cache, shared among Cores and Graphics –Significant performance boost, saves memory bandwidth and power Integrated Memory Controller and PCI Express ports –Tightly integrated with Core/Graphics/LLC domain –Provides low latency & low power – remove intermediate busses Bandwidth is balanced across the whole machine, from Core/Graphics all the way to Memory Controller Modular uArch for optimal cost/power/performance –Derivative products done with minimal effort/time Foil taken from IDF 2011

32 Computer Structure 2012 – Uncore 32 DRAM

33 Computer Structure 2012 – Uncore 33 Basic DRAM chip  DRAM access sequence –Put Row on addr. bus and assert RAS# (Row Addr. Strobe) to latch Row –Put Column on addr. bus and assert CAS# (Column Addr. Strobe) to latch Col –Get data on address bus Row Address Latch Row Address decoder Column addr decoder CAS# RAS# Data Memory array Addr Column Address Latch

34 Computer Structure 2012 – Uncore 34 DRAM Operation  DRAM cell consists of transistor + capacitor –Capacitor keeps the state; Transistor guards access to the state –Reading cell state: raise access line AL and sense DL  Capacitor charged  current to flow on the data line DL –Writing cell state: set DL and raise AL to charge/drain capacitor –Charging and draining a capacitor is not instantaneous  Leakage current drains capacitor even when transistor is closed –DRAM cell periodically refreshed every 64ms AL DL C M

35 Computer Structure 2012 – Uncore 35 DRAM Access Sequence Timing –Put row address on address bus and assert RAS# –Wait for RAS# to CAS# delay (tRCD) between asserting RAS and CAS –Put column address on address bus and assert CAS# –Wait for CAS latency (CL) between time CAS# asserted and data ready –Row precharge time: time to close current row, and open a new row tRCD – RAS/CAS delay tRP – Row Precharge RAS# Data A[0:7] CAS# Data n Row iCol n Row j X CL – CAS latency X

36 Computer Structure 2012 – Uncore 36 DRAM controller  DRAM controller gets address and command –Splits address to Row and Column –Generates DRAM control signals at the proper timing  DRAM data must be periodically refreshed –DRAM controller performs DRAM refresh, using refresh counter DRAM address decoder Time delay gen. address mux RAS# CAS# R/W# A[20:23] A[10:19] A[0:9] Memory address bus D[0:7] Select Chip select

37 Computer Structure 2012 – Uncore 37  Paged Mode DRAM – Multiple accesses to different columns from same row – Saves RAS and RAS to CAS delay  Extended Data Output RAM (EDO RAM) – A data output latch enables to parallel next column address with current column data Improved DRAM Schemes RAS# Data A[0:7] CAS# Data nD n+1 RowXCol n XCol n+1 XCol n+2 X D n+2 X RAS# Data A[0:7] CAS# Data nData n+1 RowXCol n XCol n+1 XCol n+2 X Data n+2 X

38 Computer Structure 2012 – Uncore 38  Burst DRAM – Generates consecutive column address by itself Improved DRAM Schemes (cont) RAS# Data A[0:7] CAS# Data nData n+1 RowXCol n X Data n+2 X

39 Computer Structure 2012 – Uncore 39 Synchronous DRAM – SDRAM  All signals are referenced to an external clock (100MHz-200MHz) –Makes timing more precise with other system devices  4 banks – multiple pages open simultaneously (one per bank)  Command driven functionality instead of signal driven –ACTIVE: selects both the bank and the row to be activated  ACTIVE to a new bank can be issued while accessing current bank –READ/WRITE: select column  Burst oriented read and write accesses –Successive column locations accessed in the given row –Burst length is programmable: 1, 2, 4, 8, and full-page  May end full-page burst by BURST TERMINATE to get arbitrary burst length  A user programmable Mode Register –CAS latency, burst length, burst type  Auto pre-charge: may close row at last read/write in burst  Auto refresh: internal counters generate refresh address

40 Computer Structure 2012 – Uncore 40 SDRAM Timing  t RCD : ACTIVE to READ/WRITE gap =  t RCD (MIN) / clock period   t RC : successive ACTIVE to a different row in the same bank  t RRD : successive ACTIVE commands to different banks BL = 1

41 Computer Structure 2012 – Uncore 41 DDR-SDRAM  2n-prefetch architecture –DRAM cells are clocked at the same speed as SDR SDRAM cells –Internal data bus is twice the width of the external data bus –Data capture occurs twice per clock cycle  Lower half of the bus sampled at clock rise  Upper half of the bus sampled at clock fall  Uses 2.5V (vs. 3.3V in SDRAM) –Reduced power consumption n:2n-1 0:n-1 200MHz clock 0:2n-1 SDRAM Array 400M xfer/sec

42 Computer Structure 2012 – Uncore 42 DDR SDRAM Timing 133MHz clock cmd Bank Data Addr NOP X ACT Bank 0 Row iX RD Bank 0 Col j t RCD >20ns ACT Bank 0 Row l t RC >70ns ACT Bank 1 Row m t RRD >20ns CL=2 NOP X X X X X X RD Bank 1 Col n NOP X X X X X X X X j n

43 Computer Structure 2012 – Uncore 43 DIMMs  DIMM: Dual In-line Memory Module –A small circuit board that holds memory chips  64-bit wide data path (72 bit with parity) –Single sided: 9 chips, each with 8 bit data bus –Dual sided: 18 chips, each with 4 bit data bus –Data BW: 64 bits on each rising and falling edge of the clock  Other pins –Address – 14, RAS, CAS, chip select – 4, VDC – 17, Gnd – 18, clock – 4, serial address – 3, …

44 Computer Structure 2012 – Uncore 44 DDR Standards  DRAM timing, measured in I/O bus cycles, specifies 3 numbers –CAS Latency – RAS to CAS Delay – RAS Precharge Time  CAS latency (latency to get data in an open page) in nsec –CAS Latency × I/O bus cycle time  Total BW for DDR400 –3200M Byte/sec = 64 bit  2  200MHz / 8 (bit/byte) –6400M Byte/sec for dual channel DDR SDRAM Standard name Mem. clock (MHz) I/O bus clock (MHz) Cycle time (ns) Data rate (MT/s) V DDQ (V) Module name transfer rate (MB/s) Timing (CL-tRCD- tRP) CAS Latency (ns) DDR PC DDR ⅓ ⅔PC ⅓ DDR ⅔ 6333⅓PC ⅔ DDR PC

45 Computer Structure 2012 – Uncore 45 DDR2  DDR2 doubles the bandwidth –4n pre-fetch: internally read/write 4× the amount of data as the external bus –DDR2-533 cell works at the same freq. as a DDR266 cell or a PC133 cell –Prefetching increases latency  Smaller page size: 1KB vs. 2KB –Reduces activation power – ACTIVATE command reads all bits in the page  8 banks in 1Gb densities and above –Increases random accesses  1.8V (vs 2.5V) operation voltage –Significantly lower power Memory Cell Array I/O Buffers Data Bus Memory Cell Array I/O Buffers Data Bus Memory Cell Array I/O Buffers Data Bus

46 Computer Structure 2012 – Uncore 46 DDR2 Standards Standard name Mem clock (MHz) Cycle time I/O Bus clock (MHz) Data rate (MT/s) Module name Peak transfer rate Timings CAS Latency DDR ns PC MB/ s DDR ns PC MB/ s DDR ns 333 MHz 667 PC MB/ s DDR ns 400 MHz 800 PC MB/ s DDR ns 533 MHz 1066 PC MB/ s

47 Computer Structure 2012 – Uncore 47 DDR3  30% power consumption reduction compared to DDR2 –1.5V supply voltage, compared to DDR2's 1.8V –90 nanometer fabrication technology  Higher bandwidth –8 bit deep prefetch buffer (vs. 4 bit in DDR2 and 2 bit in DDR)  Transfer data rate –Effective clock rate of 800–1600 MHz using both rising and falling edges of a 400–800 MHz I/O clock –DDR2: 400–800 MHz using a 200–400 MHz I/O clock –DDR: 200–400 MHz based on a 100–200 MHz I/O clock  DDR3 DIMMs –240 pins, the same number as DDR2, and are the same size –Electrically incompatible, and have a different key notch location

48 Computer Structure 2012 – Uncore 48 DDR3 Standards Standard Name Mem clock (MHz) I/O bus clock (MHz) I/O bus Cycle time (ns) Data rate (MT/s) Module name Peak transfer rate (MB/s) Timings (CL-tRCD- tRP) CAS Latency (ns) DDR PC ⁄ 2 15 DDR ⅓533⅓ ⅔PC ⅓ ⁄ ⁄ 8 15 DDR ⅔666⅔ ⅓PC ⅔ ⁄ 2 DDR PC ⁄ ⁄ ⁄ 4 DDR ⅓933⅓ ⅔PC ⅓ ⁄ ⁄ 7 DDR ⅔1066⅔ ⅓PC ⅔ ⁄ ⁄ 16

49 Computer Structure 2012 – Uncore 49 DDR2 vs. DDR3 Performance The high latency of DDR3 SDRAM has negative effect on streaming operations Source: xbitlabs xbitlabs

50 Computer Structure 2012 – Uncore 50 How to get the most of Memory ?  Single Channel DDR  Dual channel DDR –Each DIMM pair must be the same  Balance FSB and memory bandwidth –800MHz FSB provides 800MHz × 64bit / 8 = 6.4 G Byte/sec –Dual Channel DDR400 SDRAM also provides 6.4 G Byte/sec CH A DDR DIMM DDR DIMM CH B L2 Cache CPU FSB – Front Side Bus DRAM Ctrlr L2 Cache CPU FSB – Front Side Bus Memory Bus DRAM Ctrlr DDR DIMM

51 Computer Structure 2012 – Uncore 51 How to get the most of Memory ?  Each DIMM supports 4 open pages simultaneously –The more open pages, the more random access –It is better to have more DIMMs  n DIMMs: 4n open pages  DIMMs can be single sided or dual sided –Dual sided DIMMs may have separate CS of each side  The number of open pages is doubled (goes up to 8)  This is not a must – dual sided DIMMs may also have a common CS for both sides, in which case, there are only 4 open pages, as with single side

52 Computer Structure 2012 – Uncore 52 SRAM – Static RAM  True random access  High speed, low density, high power  No refresh  Address not multiplexed  DDR SRAM –2 READs or 2 WRITEs per clock –Common or Separate I/O –DDRII: 200MHz to 333MHz Operation; Density: 18/36/72Mb+  QDR SRAM –Two separate DDR ports: one read and one write –One DDR address bus: alternating between the read address and the write address –QDRII: 250MHz to 333MHz Operation; Density: 18/36/72Mb+

53 Computer Structure 2012 – Uncore 53 SRAM vs. DRAM  Random Access: access time is the same for all locations DRAM – Dynamic RAMSRAM – Static RAM RefreshRefresh neededNo refresh needed AddressAddress muxed: row+ columnAddress not multiplexed AccessNot true “Random Access”True “Random Access” densityHigh (1 Transistor/bit)Low (6 Transistor/bit) Powerlowhigh Speedslowfast Price/bitlowhigh Typical usageMain memorycache

54 Computer Structure 2012 – Uncore 54 Read Only Memory (ROM)  Random Access  Non volatile  ROM Types –PROM – Programmable ROM  Burnt once using special equipment –EPROM – Erasable PROM  Can be erased by exposure to UV, and then reprogrammed –E 2 PROM – Electrically Erasable PROM  Can be erased and reprogrammed on board  Write time (programming) much longer than RAM  Limited number of writes (thousands)


Download ppt "Computer Structure 2012 – Uncore 1 Computer Structure The Uncore."

Similar presentations


Ads by Google