Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.

Similar presentations


Presentation on theme: "Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements."— Presentation transcript:

1 Adam Kunk Anil John Pete Bohman

2  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements IBM PowerPC architecture v2.06  Clock Rate: 2.4 GHz - 4.25 GHz  Feature size: 45 nm  ISA: Power ISA v 2.06 (RISC)  Cores: 4, 6, 8  Cache: L1, L2, L3 – On Chip References: [1], [5]

3  PERCS – Productive, Easy-to-use, Reliable Computer System  DARPA funded contract that IBM won in order to develop the Power7 ($244 million contract, 2006) ▪ Contract was to develop a petascale supercomputer architecture before 2011 in the HPCS (High Performance Computing Systems) project.  IBM, Cray, and Sun Microsystems received HPCS grant for Phase II.  IBM was chosen for Phase III in 2006. References: [1], [2]

4  Side note:  The Blue Waters system was meant to be the first supercomputer using PERCS technology.  But, the contract was cancelled (cost and complexity).

5 2004 2001 20072010 POWER4/4+  Dual Core Dual Core  Chip Multi Processing Chip Multi Processing  Distributed Switch Distributed Switch  Shared L2 Shared L2  Dynamic LPARs (32) Dynamic LPARs (32)  180nm, 180nm, POWER5/5+  Dual Core & Quad Core Md Dual Core & Quad Core Md  Enhanced Scaling Enhanced Scaling  2 Thread SMT 2 Thread SMT  Distributed Switch + Distributed Switch +  Core Parallelism + Core Parallelism +  FP Performance + FP Performance +  Memory bandwidth + Memory bandwidth +  130nm, 90nm 130nm, 90nm POWER6/6+  Dual Core Dual Core  High Frequencies High Frequencies  Virtualization + Virtualization +  Memory Subsystem + Memory Subsystem +  Altivec Altivec  Instruction Retry Instruction Retry  Dyn Energy Mgmt Dyn Energy Mgmt  2 Thread SMT + 2 Thread SMT +  Protection Keys Protection Keys  65nm 65nm POWER7/7+  4,6,8 Core 4,6,8 Core  32MB On-Chip eDRAM 32MB On-Chip eDRAM  Power Optimized Cores Power Optimized Cores  Mem Subsystem ++ Mem Subsystem ++  4 Thread SMT++ 4 Thread SMT++  Reliability + Reliability +  VSM & VSX VSM & VSX  Protection Keys+ Protection Keys+  45nm, 32nm 45nm, 32nm POWER8 Future First Dual Core in Industry Hardware Virtualization for Unix & Linux Fastest Processor In Industry Most POWERful & Scalable Processor in Industry References: [3]

6 Cores:  8 Intelligent Cores / chip (socket)  4 and 6 Intelligent Cores available on some models  12 execution units per core  Out of order execution  4 Way SMT per core  32 threads per chip  L1 – 32 KB I Cache / 32 KB D Cache per core  L2 – 256 KB per core Chip:  32MB Intelligent L3 Cache on chip Core L2 Core L2 Memory Interface Core L2 Core L2 Core L2 Core L2 Core L2 Core L2 GXGX SMPFABRICSMPFABRIC POWERPOWER BUSBUS Memory++ L3 Cache eDRAM References: [3]

7

8  Each core implements “aggressive” out-of- order (OoO) instruction execution  The processor has an Instruction Sequence Unit capable of dispatching up to six instructions per cycle to a set of queues  Up to eight instructions per cycle can be issued to the Instruction Execution units References: [4]

9

10  8 inst. fetched from L2 to L1 I-cache or fetch buffer  Balanced instruction rates across active threads  Inst. Grouping  Instructions belonging to group issued together  Groups contain independent instructions

11  POWER7 uses different mechanisms to predict the branch direction (taken/not taken) and the branch target address.  Instruction Fetch Unit (IFU) supports 3-cycle branch scan loop (to scan instructions for branches taken, compute target addresses, and determine if it is an unconditional branch or taken) References: [5]

12  Tournament Predictor (due to GSEL):  8-K entry local BHT (LBHT) ▪ BHT – Branch History Table  16-K entry global BHT (GBHT)  8-K entry global selection array (GSEL)  All arrays above provide branch direction predictions for all instructions in a fetch group (fetch group - up to 8 instructions)  The arrays are shared by all threads References: [5]

13  8-K LBHT directly indexed by 10 bits from instruction fetch address  The GBHT and GSEL arrays are indexed by the instruction fetch address hashed with a 21-bit global history vector (GHV) folded down to 11 bits, one per thread  Value in GSEL chooses between LBHT and GBHT References: [5]

14  Value in GSEL chooses between LBHT and GBHT for the direction of the prediction of each individual branch  Hence the tournament predictor!  Each BHT (LBHT and GBHT) entry contains 2 bits  Higher order bit determines direction (taken/not taken)  Lower order bit provides hysteresis (history of the branch) References: [5]

15  Predicted in two ways:  Indirect branches that are not subroutine returns use a 128-entry count cache (shared by all active threads). ▪ Count cache is indexed by doing an XOR of 7 bits from the instruction fetch address and the GHV (global history vector) ▪ Each entry in the count cache contains a 62-bit predicted address with 2 confidence bits  Subroutine returns are predicted using a link stack (one per thread). ▪ This is like the “Return Address Stack” discussed in lecture References: [5]

16  Each POWER7 core has 12 execution units:  2 fixed point units  2 load store units  4 double precision floating point units (2x power6)  1 vector unit  1 branch unit  1 condition register unit  1 decimal floating point unit References: [4]

17

18

19  IBM POWER7 Demo IBM POWER7 Demo  Visual representation of the SMT capabilities of the POWER7  Brief introduction to the on-chip L3 cache

20  Simultaneous Multithreading  Separate instruction streams running concurrently on the same physical processor  POWER7 supports:  2 pipes for storage instructions (load/stores)  2 pipes for executing arithmetic instructions (add, subtract, etc.)  1 pipe for branch instructions (control flow)  Parallel support for floating-point and vector operations References: [7], [8]

21  Simultaneous Multithreading Explanation:  SMT1: Single instruction execution thread per core  SMT2: Two instruction execution threads per core  SMT4: Four instruction execution threads per core  This means that an 8-core Power7 can execute 32 threads simultaneously  POWER7 supports SMT1, SMT2, SMT4 References: [5], [8]

22 Thread 1 ExecutingThread 0 ExecutingNo Thread Executing FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Single thread Out of Order FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL S80 HW Multi-thread FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL POWER5 2 Way SMT FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL POWER7 4 Way SMT Thread 3 ExecutingThread 2 Executing References: [3]

23

24

25  (Look at section 2.1.4 in http://www.redbooks.ibm.com/redpapers/pd fs/redp4639.pdf) http://www.redbooks.ibm.com/redpapers/pd fs/redp4639.pdf

26 ParameterL1L2L3 (Local)L3 (Global) Size64 KB (32K I, 32K D) 256 KB4 MB32 MB LocationCore On-Chip Access Time.5 ns2 ns6 ns30 ns Associativity4-way I-cache 8-way D-cache 8-way Write PolicyWrite ThroughWrite BackPartial VictimAdaptive Line size128 B

27  On-Chip cache required for sufficient bandwidth to 8 cores.  Previous off-chip socket interface unable to scale  Support dynamic cores  Utilize ILP and increased SMT latency overlap

28  I and D cache split to reduce latency  Way prediction bits reduce hit latency  Write-Through  No L1 write-backs required on line eviction  High speed L2 able to handle bandwidth  B-Tree LRU replacement

29  Superset of L1 (inclusive)  Reduced latency by decreasing capacity  L2 utilizes L3-Local cache as victim cache  Increased associativity

30  32 MB Fluid L3 cache  4 MB of local L3 cache per 8 cores ▪ Local cache closer to respective core, reduced latency  L3 cache access routed to the local L3 cache first  Cache lines cloned when used by multiple cores

31

32  Embedded Dynamic Random-Access memory  Less area (1 transistor vs. 6 transistor SRAM)  Enables on-chip L3 cache ▪ Reduces L3 latency ▪ Larger internal bus size which increases bandwidth  Compared to off chip SRAM cache ▪ 1/6 latency ▪ 1/5 standby power  Utilized in game consoles (PS2, Wii, Etc.) References: [5], [6]

33

34

35  1. http://en.wikipedia.org/wiki/POWER7 1. http://en.wikipedia.org/wiki/POWER7  2. http://en.wikipedia.org/wiki/PERCS 2. http://en.wikipedia.org/wiki/PERCS  3. Central PA PUG POWER7 review.ppt  http://www.google.com/url?sa=t&rct=j&q=&esrc =s&source=web&cd=1&ved=0CCEQFjAA&url=ht tp%3A%2F%2Fwww.ibm.com%2Fdeveloperwor ks%2Fwikis%2Fdownload%2Fattachments%2F1 35430247%2FCentral%2BPA%2BPUG%2BPOW ER7%2Breview.ppt&ei=3El3T6ejOI-40QGil- GnDQ&usg=AFQjCNFESXDZMpcC2z8y8NkjE- v3S_5t3A

36  4. http://www.redbooks.ibm.com/redpapers/pdfs/redp 4639.pdf http://www.redbooks.ibm.com/redpapers/pdfs/redp 4639.pdf  5. http://www.serc.iisc.ernet.in/~govind/243/Power7.pd f http://www.serc.iisc.ernet.in/~govind/243/Power7.pd f  6. http://en.wikipedia.org/wiki/EDRAMhttp://en.wikipedia.org/wiki/EDRAM  7. http://www.spscicomp.org/ScicomP16/presentation s/Power7_Performance_Overview.pdf http://www.spscicomp.org/ScicomP16/presentation s/Power7_Performance_Overview.pdf  8. http://www- 03.ibm.com/systems/resources/pwrsysperf_SMT4O nP7.pdfhttp://www- 03.ibm.com/systems/resources/pwrsysperf_SMT4O nP7.pdf


Download ppt "Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements."

Similar presentations


Ads by Google