Presentation is loading. Please wait.

Presentation is loading. Please wait.

Features that you (most probably) didn’t know your Microprocessor had

Similar presentations


Presentation on theme: "Features that you (most probably) didn’t know your Microprocessor had"— Presentation transcript:

1 Features that you (most probably) didn’t know your Microprocessor had
Joseph B. Manzano Spring 2009

2 Outline The Powerful and the Fallen The Mutualists The Just Passing
The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

3 The Powerful and The Fallen
Multiple Issue Architectures: Increase your IPC / Take advantages of ILP Common Name Issue Structure Hazard Detection Scheduling Distinguishing characteristics Examples Superscalar (static) Dynamic Hardware Static In order execution Sun UltraSPARC II and III Superscalar (dynamic) hardware Some out of order execution IBM Power2 Superscalar (speculative) Dynamic With speculation Speculative out of order execution Pentium 3 and 4 VLIW / LIW Software No hazards between issues packets Trimedia, i860 EPIC Mostly Static Mostly Software Explicit Dependences marked by compiler Itanium Register Renaming Tomasulo Algorithm Reorder Buffer Scoreboarding

4 The Powerful and The Fallen
Based on the CDC 6000 Architecture Scoreboarding Important Feature: Scoreboard Issue: WAW, Decode: RAW, execute and write results: WAR Reorder Buffer Tomasulo Algorithm Implemented in the IBM360/91’s floating point unit. Important Feature: Reservation Station and CDB Issue: tag if not available, copy if they are; Execute: stall RAW monitoring the CDB Write results: Send results to the CDB and dump the store buffer contents; Exception Handling: No insts can be issued until a branch can be resolved Register Renaming

5 The Powerful and The Fallen
Dual Core Two way SMT IBM PowerPC SuperScalar Architecture. Picture Courtesy of IBM from “Power5 Microarchitecture”

6 The Powerful and The Fallen
Intel Xeon Out of Order Engine Pipeline Picture Courtesy of Intel from “Hyper-Threading Technology Architecture and Microarchitecture”

7 Outline The Powerful and the Fallen The Mutualists The Just Passing
The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

8 The Mutualists Vector Processing Super Computer of the past
SIMD type of design Elements of the data stream are worked by a single type of instruction Simplifies hardware design Moving toward more “general” purpose vector processing

9 The Mutualists PPE Created by STI The Cell Broadband Engine
Composed of nine computing elements A modified Vector Arch Limited memory: 256 KiB All accesses are to and from this local memory Main Memory Accesses  DMA transfers The brain of the system Organizer Runs Linux PowerPC dual issue arch Each SPE has a MFC unit Issue and receive DMA to and from main memory Gate Keeper of the bus BEI Flex IO Memory Interface SPE PPSS PPE MFC Four rings Has QoS in a limited fashion (RAM) Maintain coherency and consistency between all memory units (the MFC, main memory and PPE caches, but not across the local memory of SPEs)

10 Outline The Powerful and the Fallen The Mutualists The Just Passing
The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

11 The Just Passing Cache  “Invisible” architecture component
Not so much in the last years PowerPC and other architecture provides instructions to control dcbf[e], dcbst[e], dcbz[e], icbi[e], isync Instruction available to touch, to zeroed out, to reserve, or to lock a line in place. But for some interesting designs look no further than …

12 The Just Passing XBOX 360 Xenon Architectures
Picture Courtesy of IBM from ”XBOX 360 System Microarchitecture”

13 Outline The Powerful and the Fallen The Mutualists The Just Passing
The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

14 The Olympic Sprinters The Hertz race is over; however …
Some processors are still at it … Power 6 and 7 running at 4 and 5 GHz Intel Polaris: 3.6 to 6 GHz Many hardware re-designs are in order Make pipelines shorter, simpler Get rid of “extra” hardware features

15 The Olympic Sprinters Power6 Running at frequencies from 4 to 5 GHz
Pictures Courtesy of Intel from “IBM Power6 Microarchitecture” Power6 Running at frequencies from 4 to 5 GHz 13 FO4 versus 23 FO4 pipeline

16 Outline The Powerful and the Fallen The Mutualists The Just Passing
The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

17 The Threads’ Commune Large shared memory systems are becoming scarce
Scalability issues due to synchronization Contention Coherency and Consistency Novel Solutions have emerged Explicit memory hierarchies with very weak memory models Massive Multithreading on chip Synchronization in memory

18 The Threads’ Commune Cray XMT An example of a very high SMT design
128 Hardware streams A stream is bit registers, 8 target registers, and a control register Three functional units: M, A and C 500 MHz Full and Empty bits per word (2-bits) An example of a very high SMT design

19 The Threads’ Commune SMT / HT designs
Time Issue Slot Super Scalar Coarse MT Fine MT SMT

20 The Threads’ Commune Cray MTA2 picture from Jonh Feo’s “Can programmers and Machines ever be friends”

21 The Threads’ Commune Data Race or Race Condition
“There is an anomaly of concurrent accesses by two or more threads to a shared memory and at least one of the accesses is a write” The orchestration of two or more threads (or processes) to complete a task in a correct manner and to avoid any data races Problems Separation of lock and guarded data

22 The Threads’ Commune Coherency and Consistency
Caching elements and make sure that everyone sees the last copy If an element is written by processor A then how processor B and C will know that they have the latest copy? Very difficult problem! One of the scalability problems of Shared memory

23 The Threads’ Commune How Cray XMT solves these problems?
For Synchronization: Join the lock with each data word and put the synchronization requirement on the memory instead that the processor For coherence and consistency: DO NOT cache remote data (outside the local 8 GiB)

24 Outline The Powerful and the Fallen The Mutualists The Just Passing
The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

25 Breaking the Despotic Rule of the Lock
Synchronization Atomicity and Seriability Locks and Barriers Around hundreds to ten thousands of cycles and grows linearly (in the best cases) or polynomial (in the worst cases) with the number of processors The lock The most used synch primitive! Alternatives: Lock-free data structures

26 Breaking the Despotic Rule of the Lock
Lock Free Data Structures Used to implement non blocking or / and wait free algorithms Prevents deadlocks, livelocks and priority inversions Potential problems: ABA problem It tells us no-one is working on this now, but not if someone has done it before Transactional Memory Based on transactions (an atomic bundle operations) If two transactions conflict then one is bound to fail

27 Side Note A Review of LL and SC
PowerPC and many other architecture instructions Provide a way to optimistically execute a piece of code In case that a “violation” has taken place, discard your results Many implementations PowerPC: lwarx and stwcx

28 Side Note The LL and SC behavior
The lwarx instruction Loads a word aligned location Side Effects: A reservation is created Storage coherence mechanism is notified that a reservation exists The stwcx instruction Conditionally Store a location to a given memory location. Conditionally  Depends on the reservation If success, all changes will be committed to memory If not, changes will be discarded.

29 Side Note Reservations
At most one per processor A reservation is lost when Processor holding the reservation executes A lwarx or ldarx A stwcx or stdcx (No matter if the reservation matches or not) Other processors executes A store or a dcbz to the granule Some other mechanism modifies a storage location in the same reservation granule Interrupts does not clean reservations But interrupt handlers might Granularity The length of the memory block to keep under surveillance

30 Side Note Examples LL a = ? a *= 100; … SC a brnz Memory
Storage Mechanism a = ?

31 Side Note Examples LL a = ? LL a = ? a *= 100; a += 100; SC a SC a
brnz brnz a a Memory Storage Mechanism a = ?

32 Side Note Examples LL a = ? LL a = ? a *= 100; a += 100; a = 100; SC a
brnz brnz X X Memory Storage Mechanism a = 100

33 Side Note Examples LL a = ? LL a = ? a *= 100; a += 100; SC a SC a
brnz brnz X X Memory Storage Mechanism a = 100

34 Side Note Examples LL a = ? LL a = 100 a *= 100; a += 100; SC a SC a
brnz brnz X a Memory Storage Mechanism a = 100

35 Side Note Examples LL a = 100 LL a = 100 a *= 100; a += 100; SC a SC a
brnz brnz a a Memory Storage Mechanism a = 100

36 Side Note Examples LL a = 100 LL a = 100 a *= 100; a += 100; SC a SC a
brnz brnz X a Memory Storage Mechanism a = 200

37 Side Note Examples LL a = 100 a *= 100; SC a brnz Memory
Storage Mechanism a = 200

38 Side Note Examples LL a = 200 a *= 100; SC a brnz Memory
Storage Mechanism a = 200

39 Side Note Examples LL a = 200 a *= 100; SC a brnz Memory
Storage Mechanism a =20000

40 Breaking the Despotic Rule of the Lock
Sun Rock Processor Execute Ahead Scouting Threads Simultaneous Multithreading Transactional Memory Checkpoint Cache memory with extra bits for tracking speculative execution 32 logical threads and 16 physical cores Pictures courtesy of “Rock: A SPARC CMT Processor”

41 Breaking the Despotic Rule of the Lock
Take a “RISC”-y Approach Small transaction  HW Best effort Use the checkpoint mechanism! Transactions == Software construct Checkpoint in case of failure Commit on successful transaction Executed speculative by a strand Use the cache store buffers and locks cache lines until commit ( tracking lines with the “s-bits” )

42 Multi-core Trends in this Decade
UltraSparc T1 Codename: Niagara 8 Core Processor, 32 Logical Threads Multi-core Trends in this Decade Codename: Rock 16 Core Processor, 32 Logical Threads AMD Turion64 X2 IA32 x86 Dual Core Chip Intel Core Duo IA32 x86 Dual Core Chip Intel Core 2 Codename: Penryn, Wolfdale IA32 x86 Dual & Quad Core Chip Pentium D IA32 x86 2 Core Chip Power5 64 bit PowerPC 2 Core with SMT CBE PowerPC 9 Core chip Power7 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Power 4 64 bit PowerPC 2 Core Codename: Nehalem 1 to 8 Core Chip Xenon 64 bit PowerPC 3 Core chip Power 6 64 bit PowerPC 2 Core with SMT Xeon Dual Core IA32 x86 2 Core Chip Intel Core 2 Duo IA32 x86 2 Core Chip Codename: Sandy Bridge IBM AMD Opteron Code Name: Denmark IA32 x86 2 Core Chip AMD Code Name: Barcelona IA32 x86 Native 4 Core Chip Intel UltraSparc T2 Codename: Niagara 8 Core Processor, 64 Logical Threads AMD SUN

43 Sources The Powerful and the Fallen The Mutualists The Just Passing
Sinharoy, B et al, “Power5 System Microarchitecture”, IBM Journal of Research and Development, Vol 49, June/September 2005 Marr, D et al, “Hyper-Threading Technology Architecture and Microarchitecture” Intel Technology Journal, Vol 6, Issue 1, 2002 The Mutualists The Just Passing Andrews, Jeff and Baker, Nick “XBOX 360 System Architecture”, IEEE Micro, Volume 26, Issue 2 March 2006 The Olympic Sprinters Le, H.Q. et al, “Power6 System Microarchitecture,” IBM Journal of Research and Development, Vol 61, November 2007 The Threads’ Commune Konecny, P, “Introducing the Cray XMT,” May 5th, 2007 Feo, J ,“Can programmers and machines can ever be friends?” Breaking the Despotic Rule of the Lock Chaundhry, S, “Rock: A SPARC CMT Processor”, August 26, 2008


Download ppt "Features that you (most probably) didn’t know your Microprocessor had"

Similar presentations


Ads by Google