Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE 4100/6100 (1) Multicore: Commercial Processors.

Similar presentations


Presentation on theme: "ECE 4100/6100 (1) Multicore: Commercial Processors."— Presentation transcript:

1 ECE 4100/6100 (1) Multicore: Commercial Processors

2 ECE 4100/6100 (2) Some Examples Desktop and Server/Enterprise Space –Intel –AMD –SUN Microsystems The Embedded Space: Freescale Semiconductor

3 ECE 4100/6100 (3) Focus The Chip Level Architecture –What do we have on chip? The Core Architecture –Note the presence/absence/configuration of concepts studied earlier in class –Rationalize the design decisions that led to the preceding –What can/should we expect next? Building systems using multicore chips

4 ECE 4100/6100 (4) The Intel Core Duo Processor Series

5 ECE 4100/6100 (5) Intel Core Duo  Homogeneous cores  Bus based on chip interconnect  Shared Memory  Traditional I/O Classic OOO: Reservation Stations, Issue ports, Schedulers…etc Large, shared set associative, prefetch, etc. Source: Intel Corp.

6 ECE 4100/6100 (6) Intel Core Duo: Vital Stats 151 million transistors; Shared 2 MB L2 cache Each core has a 12 stage pipeline (Yonah) Low-power (less than 25 watts) Dual Core microprocessor Supports Intel’s Vanderpool virtualization technology EM64T (Intel x86-64 extensions) is not supported –Desktop market – not severe due to lack of OS and software –Sossaman processor for servers, which is based on Yonah, also lacks EM64T-support  severe disadvantage Communication between the L2 cache and both execution cores is handled by an arbitration bus unit –Eliminates cache coherency traffic over the FSB –Raises the core-to-L2 latency –The increase in clock frequency offsets the impact Core processors communicate with the system chipset over a 667 MT/s front side bus (FSB), up from 533 MT/s used by the fastest Pentium M. Intel Core Solo uses the same two-core die as the Core Duo, but features only one active core –Chips failing quality control can be sold –Core 2 Duo processors will also include the ability to disable one core to conserve power

7 ECE 4100/6100 (7) The Core™ micro-architecture Source: Ars Technica

8 ECE 4100/6100 (8) The Core Execution core Source: Ars Technica

9 ECE 4100/6100 (9) Intel Core Duo High memory latency due to the lack of on-die memory controller (further aggravated by system- chipset's use of DDR-II RAM) Main-memory transactions have to pass through the Northbridge of the chipset –Higher latency compared to the AMD's Turion platform. –Weakness shared by the entire line of Pentium processors –L2-cache is quite effective at hiding main-memory latency Execution units –Three 64-bit integer exec units one CIU (complex) + two SIU (simple) –Two FPUs –Poor Floating Point Unit (FPU) throughput Limited to little "performance per watt" in single threaded applications compared to its predecessor.

10 ECE 4100/6100 (10) Core 2 Duo and Core Duo Very similar architectures Bump in the processor speed Increase in Level 2 cache. (2MB to 4MB) Both chips have a 65-nm process technology architecture and support a 667 MHz front-side-bus (FSB). 14 stage pipeline Source: Intel Corp.

11 ECE 4100/6100 (11) Intel ® Core TM 2 Duo Processor Process Technology65 nm Number of Processor Cores2 L2 Cache Size (shared between 2 processor cores) Up to 4MB Transistor Gate Height / Gate Oxide Thickness (65 nm)1.2 nm Transistor Gate Length (for 65nm Process Technology)35 nm Line Width65 nm Number of Transistors291 million Processor Die Size143 mm 2 Average Power<1.1 Watt

12 ECE 4100/6100 (12) Intel Core 2 Duo Source: Hard Core Hardware

13 ECE 4100/6100 (13) Wide Dynamic Execution Source: Bit Tech

14 ECE 4100/6100 (14) Wide Dynamic Execution Source: Bit Tech

15 ECE 4100/6100 (15) Wide Dynamic Execution Pipe width of 4 execution units per chip (Pentium M/Pentium 4 Netburst have 3) Delivery of more instructions per clock cycle Pipeline depth of 14 vs. 31 in Pentium Prescott 4 –Compromise between efficient execution of short instructions and long instructions Ops fusion –Less work for the processor pipeline to run –Micro-ops fusion fuse together repetitive instructions in x86 code –Macro-ops fusion works on the x86 instructions themselves, not just their micro derivatives. Instruction loads and micro-ops can be reduced by approximately 15% and 10%, respectively

16 ECE 4100/6100 (16) Intelligent Power Capability Source: Bit Tech

17 ECE 4100/6100 (17) SpeedStep technology –Dyamic clock speed reduction –Intel mobile processors include this already –Enhanced SpeedStep used in Core 2 Duo Controller that turns on sections of the processor as needed. One core can be shut down for single-threaded applications Power consumption decreased by enhancements to Intel's 65nm process node –use Low-K dielectrics and strained silicon –use low-leakage and "sleep" transistors Intelligent Power Capability

18 ECE 4100/6100 (18) Advanced Smart Cache Source: Bit Tech

19 ECE 4100/6100 (19) Advanced Smart Cache Both cores share data stored in the L2 cache via an arbitration bus unit embedded in the cache. –Dynamically allocates cache space between the two cores, minimising bus traffic by allowing both cores to access one copy of data Does larger L2 cache matter? –Studies point out that improvements in execution time are low from a 2MB to 4MB for most applications (2-4%) Source: Bit Tech

20 ECE 4100/6100 (20) Smart Memory Access Source: Bit Tech

21 ECE 4100/6100 (21) Smart Memory Access Improved prefetch units Memory disambiguation –Allows re-ordering instructions more efficiently Source: Ars Technica Example from http://arstechnica.com/articles/paedia/cpu/core.ars/8 Execution without memory disambiguation Memory Aliasing Execution with and without memory disambiguation

22 ECE 4100/6100 (22) Advanced Digital Media Boost Source: Bit Tech

23 ECE 4100/6100 (23) Advanced Digital Media Boost Streaming SIMD Extension (SSE) instructions –SSE instructions are an extension of the standard x86 instruction set. –Utilized in multimedia encoding, decoding, image manipulation and encryption SSE instructions are 128-bit. –Up from 64-bits –Double the SSE performance over previous generation

24 ECE 4100/6100 (24) Comparison of SSE to prior processors Source: Ars Technica

25 ECE 4100/6100 (25) Intel Conroe Vs Presler What is the major difference? –Shared L2 versus separate caches ConroePresler Source: Bit Tech

26 ECE 4100/6100 (26) Intel’s Roadmap for Multicore Source: Adapted from Tom’s Hardware 200620082007 SC 1MB DC 2MB DC 2/4MB shared DC 3 MB/6 MB shared (45nm) 200620082007 DC 2/4MB DC 2/4MB shared DC 4MB DC 3MB /6MB shared (45nm) 200620082007 DC 2MB DC 4MB DC 16MB QC 4MB QC 8/16MB shared 8C 12MB shared (45nm) SC 512KB/ 1/ 2MB 8C 12MB shared (45nm) Desktop processors Mobile processors Enterprise processors Drivers are –Market segments –More cache –More cores 80 core processor prototype has been designed!

27 ECE 4100/6100 (27) Intel Chipset Example Source: Extreme Tech

28 ECE 4100/6100 (28) References and Links http://www.intel.com/products/processor/coreduo/ http://en.wikipedia.org/wiki/Intel_Core http://www.hothardware.com/viewarticle.aspx?articleid=845&cid=1 http://www.bit-tech.net/hardware/2006/03/10/intel_core_microarchitecture/ http://www.bit-tech.net/hardware/2006/05/19/intel_core_duo_t2600_on_the_desktop http://www.bit-tech.net/hardware/2006/07/14/intel_core_2_duo_processors/ http://www.hardcoreware.net/reviews/review-347-1.htm http://www.trustedreviews.com/cpu-memory/review/2006/08/28/Intel-Core-2-Duo- Merom-Notebooks/p1http://www.trustedreviews.com/cpu-memory/review/2006/08/28/Intel-Core-2-Duo- Merom-Notebooks/p1 http://www.trustedreviews.com/cpu-memory/review/2006/07/14/Intel-Core-2-Duo- Conroe-E6400-E6600-E6700-X6800/p1http://www.trustedreviews.com/cpu-memory/review/2006/07/14/Intel-Core-2-Duo- Conroe-E6400-E6600-E6700-X6800/p1 http://techreport.com/reviews/2006q2/core-duo/index.x?pg=1 http://arstechnica.com/articles/paedia/cpu/core.ars/1 http://www.anandtech.com/mobile/showdoc.aspx?i=2663&p=4 http://www.extremetech.com/article2/0,1697,1988794,00.asp http://www.coreduoinfo.com/blog/about-intel-core-duo/ http://67.91.114.164/intel_c2d_info.htm http://www.pcper.com/article.php?aid=272&type=expert

29 ECE 4100/6100 (29) AMD MultiCore Processors

30 ECE 4100/6100 (30) Dual Core AMD Opteron Source: AMD

31 ECE 4100/6100 (31) AMD Multicore (Dualcore) Opteron Two AMD Opteron CPU cores on a single die –Each has 1MB L2 cache 90nm, ~205 million transistors –Approximately same die size as 130nm single-core AMD Opteron processor 95 watt power envelope – fits into 90nm power infrastructure Introduced with “ K8 ” Revision E core in April 2005 Core 0 Northbridge 1-MB L2 Core 1 1-MB L2 Source: AMD

32 ECE 4100/6100 (32) Opteron Core Pipeline Source: Chip Architect

33 ECE 4100/6100 (33) AMD Opteron Processor Core Architecture AGU Int Decode & Rename FADDFMISCFMUL 44-entry Load/Store Queue 36-entry FP scheduler FP Decode & Rename ALU AGU ALU MULT ALU Res L1 Icache 64KB L1 Dcache 64KB Fetch Branch Prediction Instruction Control Unit (72 entries) Fastpath Microcode Engine Scan/Align/Decode µops Source: The 3D shop

34 ECE 4100/6100 (34) Dual Core AMD Opteron AMD64 technology –Runs 32-bit applications and is 64-bit capable –Compatible with the x86 software infrastructure –Enables a single architecture across 32- and 64-bit environments Direct Connect Architecture –NUMA system Each processor shares its memory with other processors in the system –Integrated Memory Controller on-die DDR2 DRAM memory controller offers memory BW up to 10.7 GB/s per processor –HyperTransport Point-to-point interconnect can be used to build a mesh of multiple-processor Opteron systems Scalable bandwidth interconnect between processors, I/O subsystems, and other chipsets 24.0 GB/s peak bandwidth per processor

35 ECE 4100/6100 (35) Dual Core AMD Opteron Not a simple aggregation of K8 cores –Integrated the cores for efficiency Dual-core Opteron acts very much like a SMP system Compatible with existing single-threaded, multi- threaded (hyperthreaded) software MOESI coherency protocol (O – “Owns”) –Updates through system request interface SSE3 support with 10 new instructions. Quad-core upgradeability Hardware assisted AMD Virtualization Optimized Power Management

36 ECE 4100/6100 (36) Dual Core AMD Opteron Source: Elec Design

37 ECE 4100/6100 (37) AMD Opteron (SOI) Source: Chip Architect

38 ECE 4100/6100 (38) AMD 64 bit Core 1MB L2 Cache Detailed discussion of the 64-bit core architecture at: –http://chip- architect.com/news/2003_09_21_Detailed_ Architecture_of_AMDs_64bit_Core.htmlhttp://chip- architect.com/news/2003_09_21_Detailed_ Architecture_of_AMDs_64bit_Core.html

39 ECE 4100/6100 (39) I/O Hub USB PCI PCI-E Bridge I/O Hub PCI-E Bridge Memory Controller Hub CPU Multiprocessor Systems using AMD Opteron SRQ Crossbar HT Mem.Ctrlr SRQ Crossbar HT Mem.Ctrlr CPU 8 GB/S AMD64 Direct Connect Architecture Eliminates FSB bottleneck HyperTransport™ Technology interconnect for high bandwidth and low latency Each CPU has its own memory Each CPU can access the main memory of another processor, transparent to the programmer  Different from SMP Legacy x86 Architecture CPUs, Memory, I/O all share a bus Major bottleneck to performance Faster CPUs or more cores for performance Symmetric Multiprocessing Source: AMD

40 ECE 4100/6100 (40) Multiprocessor Systems using AMD Opteron Source: XBitlabs

41 ECE 4100/6100 (41) Cache coherency Source: Chip Architect

42 ECE 4100/6100 (42) AMD Athlon 64 X2 Source: AMD

43 ECE 4100/6100 (43) References and Links http://techreport.com/reviews/2005q2/opteron-x75/index.x?pg=1 http://www.tomshardware.com/2005/06/03/dual_core_stress_test/index.html http://www.a1-electronics.net/AMD_Section/CPUs/2005/AMD_Athlon64x2_Apr.shtml http://en.wikipedia.org/wiki/Opteron http://en.wikipedia.org/wiki/Athlon_64_X2 http://www.amd.com/us- en/Processors/ProductInformation/0,,30_118_8796_14309,00.htmlhttp://www.amd.com/us- en/Processors/ProductInformation/0,,30_118_8796_14309,00.html http://chip- architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.htmlhttp://chip- architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html http://firingsquad.com/hardware/amd_dual-core_opteron_875/page2.asp http://www.xbitlabs.com/articles/cpu/display/opteron-ws_4.html http://www.extremetech.com/article2/0,1697,1675784,00.asp http://www.elecdesign.com/Articles/Index.cfm?AD=1&ArticleID=11991 http://www.the3dshop.com/userimages/amd_systems/opteron_dualcore.htm http://www.nextcomputing.com/advantages/thruadv.shtml http://arstechnica.com/news.ars/post/20060817-7535.html http://www.bit-tech.net/hardware/2005/05/09/amd_a64x2_4800/1.html

44 ECE 4100/6100 (44) SUN – UltraSPARC Multicore

45 ECE 4100/6100 (45) SUN – UltraSPARC T1 Eight cores, each 4-way threaded 1.2 GHz Cache –16K 4-way 32B L1-I –8K 4-way 16B L1-D –3MB internal L2 cache partitioned into four banks and four memory controllers. –Data moved between the L2 and the cores using an integrated crossbar switch to provide high throughput Source: Sun

46 ECE 4100/6100 (46) SUN – UltraSPARC T1 Source: Sun

47 ECE 4100/6100 (47) SUN – UltraSPARC T1 Pipeline T1's integer pipeline –Fetch, Thread Selection, Decode, Execute, Memory Access, Writeback Source: Sun

48 ECE 4100/6100 (48) SUN UltraSPARC T2 – Niagara 2 Source: Sun

49 ECE 4100/6100 (49) SUN UltraSPARC T2 Ultra SPARC T2 has 8 threads/core (8 Sparc Cores) 8 stage integer pipeline ( as opposed to 6 for T1) Twice the performance of T1 with a transactional workload (under the same power envelope) Each thread, increased to 1.4 GHz from 1.2 GHz One PCI Express port (x8 1.0) Two 10 Gigabit Ethernet ports with packet classification and filtering L2 cache size increased to 4 MB shared (8-banks, 16-way associative) 1 floating point unit per core Eight encryption engines Four dual-channel FBDIMM memory controllers 711 signal I/O,1831 total

50 ECE 4100/6100 (50) UltraSparc T2 Core Microarchitecture Source: Realworld Tech

51 ECE 4100/6100 (51) UltraSparc T2 Memory System Source: Sun

52 ECE 4100/6100 (52) UltraSparc T2 Core Block Diagram IFU – Instruction Fetch Unit –16 KB I$, 32B lines, 8-way SA –64-entry fully-associative ITLB EXU0/1 – Integer Execution Units –4 threads share each unit –Executes one integer instrn/cycle LSU – Load/Store Unit –8KB D$, 16B lines, 4-way SA 128-entry fully-associative –DTLB FGU – Floating/Graphics Unit SPU – Stream Processing Unit –Cryptographic acceleration TLU – Trap Logic Unit –Updates machine state, handles exceptions and interrupts MMU – Memory Management Unit –Hardware tablewalk (HWTW) –8KB, 64KB, 4MB, 256MB pages Source: Sun

53 ECE 4100/6100 (53) UltraSparc T2 Core Pipeline 8 stages for integer operations: – Fetch, Cache, Pick, Decode, Execute, Memory, Bypass, Writeback –> 3-cycle load-use –Memory (translation, tag/data access) –Bypass (late select, formatting) 12 stages for floating-point: –Fetch, Cache, Pick, Decode, Execute, FX1, FX2, FX3, FX4, FX5, FB, FW – 6-cycle latency for dependent FP ops – Longer pipeline for divide/sqrt

54 ECE 4100/6100 (54) References and Links http://realworldtech.com/page.cfm?A rticleID=RWT090406012516&p=4http://realworldtech.com/page.cfm?A rticleID=RWT090406012516&p=4 http://www.opensparc.net/cgi- bin/goto.php?w=/pubs/preszo/06/H otChips06_09_ppt_master.pdfhttp://www.opensparc.net/cgi- bin/goto.php?w=/pubs/preszo/06/H otChips06_09_ppt_master.pdf http://www.freescale.com/files/netco mm/doc/fact_sheet/MPC8572FS.pdfhttp://www.freescale.com/files/netco mm/doc/fact_sheet/MPC8572FS.pdf

55 ECE 4100/6100 (55) The Embedded Multicores

56 ECE 4100/6100 (56) Freescale MPC8572 PowerQUICC III Processor Source: Freescale

57 ECE 4100/6100 (57) Freescale MPC8572 PowerQUICC III Processor Dual Embedded e500 core 36-bit physical addressing Double-precision floating-point Integrated L1/L2 cache – L1 cache—32 KB data and 32 KB – Shared L2 cache—1 MB with ECC – L2 configurable as SRAM, cache and I/O transactions can be stashed into L2 cache regions Integrated DDR memory controller with full ECC support Integrated security engine, Pattern Matching Engine, Packet Deflate Engine Four on-chip triple-speed Ethernet controllers

58 ECE 4100/6100 (58) References and Links http://www.freescale.com/files/netco mm/doc/fact_sheet/MPC8572FS.pdfhttp://www.freescale.com/files/netco mm/doc/fact_sheet/MPC8572FS.pdf

59 ECE 4100/6100 (59) Summary Multicore technology spans the product spectrum –The downward migration of leading edge technology continues Architectural principles are key to –Developers: extracting performance –Designers: improving performance –Marketing: understanding new markets for performance Research spans the spectrum of software, security, reliability, parallelelism, virtualization and much more!


Download ppt "ECE 4100/6100 (1) Multicore: Commercial Processors."

Similar presentations


Ads by Google