Presentation is loading. Please wait.

Presentation is loading. Please wait.

Joseph B. Manzano Spring 2009 Features that you (most probably) didn’t know your Microprocessor had.

Similar presentations


Presentation on theme: "Joseph B. Manzano Spring 2009 Features that you (most probably) didn’t know your Microprocessor had."— Presentation transcript:

1 Joseph B. Manzano Spring 2009 Features that you (most probably) didn’t know your Microprocessor had

2 Outline The Powerful and the Fallen The Mutualists The Just Passing The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

3 The Powerful and The Fallen Common Name Issue Structure Hazard Detection SchedulingDistinguishing characteristics Examples Superscalar (static) DynamicHardwareStaticIn order executionSun UltraSPARC II and III Superscalar (dynamic) DynamichardwareDynamicSome out of order execution IBM Power2 Superscalar (speculative) DynamicHardwareDynamic With speculation Speculative out of order execution Pentium 3 and 4 VLIW / LIWStaticSoftwareStaticNo hazards between issues packets Trimedia, i860 EPICMostly StaticMostly Software Mostly StaticExplicit Dependences marked by compiler Itanium Multiple Issue Architectures: Increase your IPC / Take advantages of ILP Register Renaming Tomasulo Algorithm Reorder Buffer Scoreboarding

4 The Powerful and The Fallen Register Renaming Tomasulo Algorithm Reorder Buffer Scoreboarding Based on the CDC 6000 Architecture Important Feature: Scoreboard Issue: WAW, Decode: RAW, execute and write results: WAR Implemented in the IBM360/91’s floating point unit. Important Feature: Reservation Station and CDB Issue: tag if not available, copy if they are; Execute: stall RAW monitoring the CDB Write results: Send results to the CDB and dump the store buffer contents; Exception Handling: No insts can be issued until a branch can be resolved

5 The Powerful and The Fallen Power5 Dual Core Two way SMT IBM PowerPC SuperScalar Architecture. Picture Courtesy of IBM from “Power5 Microarchitecture”

6 The Powerful and The Fallen Intel Xeon Out of Order Engine Pipeline Picture Courtesy of Intel from “Hyper-Threading Technology Architecture and Microarchitecture”

7 Outline The Powerful and the Fallen The Mutualists The Just Passing The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

8 The Mutualists Vector Processing Super Computer of the past SIMD type of design Elements of the data stream are worked by a single type of instruction Simplifies hardware design Moving toward more “general” purpose vector processing

9 The Mutualists The Cell Broadband EngineCreated by STI Composed of nine computing elements The brain of the system Organizer Runs Linux PowerPC dual issue arch A modified Vector Arch Limited memory: 256 KiB All accesses are to and from this local memory Main Memory Accesses  DMA transfers BEI Flex IO Memory Interface SPE PPSS SPE PPE MFC Each SPE has a MFC unit Issue and receive DMA to and from main memory Gate Keeper of the bus Four rings Has QoS in a limited fashion (RAM) Maintain coherency and consistency between all memory units (the MFC, main memory and PPE caches, but not across the local memory of SPEs)

10 Outline The Powerful and the Fallen The Mutualists The Just Passing The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

11 The Just Passing Cache  “Invisible” architecture component Not so much in the last years PowerPC and other architecture provides instructions to control dcbf[e], dcbst[e], dcbz[e], icbi[e], isync Instruction available to touch, to zeroed out, to reserve, or to lock a line in place. But for some interesting designs look no further than …

12 The Just Passing XBOX 360 Xenon Architectures Picture Courtesy of IBM from ”XBOX 360 System Microarchitecture”

13 Outline The Powerful and the Fallen The Mutualists The Just Passing The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

14 The Olympic Sprinters The Hertz race is over; however … Some processors are still at it … Power 6 and 7 running at 4 and 5 GHz Intel Polaris: 3.6 to 6 GHz Many hardware re-designs are in order Make pipelines shorter, simpler Get rid of “extra” hardware features

15 The Olympic Sprinters 13 FO4 versus 23 FO4 pipeline Power6 Running at frequencies from 4 to 5 GHz Pictures Courtesy of Intel from “IBM Power6 Microarchitecture”

16 Outline The Powerful and the Fallen The Mutualists The Just Passing The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

17 The Threads’ Commune Large shared memory systems are becoming scarce Scalability issues due to synchronization Contention Coherency and Consistency Novel Solutions have emerged Explicit memory hierarchies with very weak memory models Massive Multithreading on chip Synchronization in memory

18 The Threads’ Commune Cray XMT 128 Hardware streams A stream is bit registers, 8 target registers, and a control register Three functional units: M, A and C 500 MHz Full and Empty bits per word (2-bits) An example of a very high SMT design

19 The Threads’ Commune SMT / HT designs

20 The Threads’ Commune Cray MTA2 picture from Jonh Feo’s “Can programmers and Machines ever be friends”

21 The Threads’ Commune Data Race or Race Condition “There is an anomaly of concurrent accesses by two or more threads to a shared memory and at least one of the accesses is a write” The orchestration of two or more threads (or processes) to complete a task in a correct manner and to avoid any data races Problems Separation of lock and guarded data

22 The Threads’ Commune Coherency and Consistency Caching elements and make sure that everyone sees the last copy If an element is written by processor A then how processor B and C will know that they have the latest copy? Very difficult problem! One of the scalability problems of Shared memory

23 The Threads’ Commune How Cray XMT solves these problems? For Synchronization: Join the lock with each data word and put the synchronization requirement on the memory instead that the processor For coherence and consistency: DO NOT cache remote data (outside the local 8 GiB)

24 Outline The Powerful and the Fallen The Mutualists The Just Passing The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

25 Synchronization Atomicity and Seriability Locks and Barriers Around hundreds to ten thousands of cycles and grows linearly (in the best cases) or polynomial (in the worst cases) with the number of processors The lock The most used synch primitive! Alternatives: Lock-free data structures

26 Breaking the Despotic Rule of the Lock Lock Free Data Structures Used to implement non blocking or / and wait free algorithms Prevents deadlocks, livelocks and priority inversions Potential problems: ABA problem It tells us no-one is working on this now, but not if someone has done it before Transactional Memory Based on transactions (an atomic bundle operations) If two transactions conflict then one is bound to fail

27 Side Note A Review of LL and SC 27 PowerPC and many other architecture instructions Provide a way to optimistically execute a piece of code In case that a “violation” has taken place, discard your results Many implementations PowerPC: lwarx and stwcx

28 Side Note The LL and SC behavior 28 The lwarx instruction Loads a word aligned location Side Effects: A reservation is created Storage coherence mechanism is notified that a reservation exists The stwcx instruction Conditionally Store a location to a given memory location. Conditionally  Depends on the reservation If success, all changes will be committed to memory If not, changes will be discarded.

29 Side Note Reservations 29 At most one per processor A reservation is lost when Processor holding the reservation executes A lwarx or ldarx A stwcx or stdcx (No matter if the reservation matches or not) Other processors executes A store or a dcbz to the granule Some other mechanism modifies a storage location in the same reservation granule Interrupts does not clean reservations But interrupt handlers might Granularity The length of the memory block to keep under surveillance

30 Side Note Examples 30 LL a = ? SC a a a *= 100; brnz Storage Mechanism … Memory a = ?

31 Side Note Examples 31 LL a = ? SC a a a *= 100; brnz Storage Mechanism LL a = ? SC a a += 100; brnz a Memory a = ?

32 Side Note Examples 32 LL a = ? SC a X a *= 100; brnz Storage Mechanism LL a = ? SC a a += 100; brnz X a = 100; Memory a = 100

33 Side Note Examples 33 LL a = ? SC a X a *= 100; brnz Storage Mechanism LL a = ? SC a a += 100; brnz X Memory a = 100

34 Side Note Examples LL a = ? SC a X a *= 100; brnz Storage Mechanism LL a = 100 SC a a += 100; brnz a Memory a = 100

35 Side Note Examples LL a = 100 SC a a a *= 100; brnz Storage Mechanism LL a = 100 SC a a += 100; brnz a Memory a = 100

36 Side Note Examples LL a = 100 SC a X a *= 100; brnz Storage Mechanism LL a = 100 SC a a += 100; brnz a Memory a = 200

37 Side Note Examples 37 LL a = 100 SC a X a *= 100; brnz Storage Mechanism Memory a = 200

38 Side Note Examples 38 LL a = 200 SC a a a *= 100; brnz Storage Mechanism Memory a = 200

39 Side Note Examples LL a = 200 SC a a a *= 100; brnz Storage Mechanism Memory a =20000

40 Breaking the Despotic Rule of the Lock Sun Rock Processor Execute Ahead Scouting Threads Simultaneous Multithreading Transactional Memory Checkpoint Cache memory with extra bits for tracking speculative execution 32 logical threads and 16 physical cores Pictures courtesy of “Rock: A SPARC CMT Processor”

41 Breaking the Despotic Rule of the Lock Take a “RISC”-y Approach Small transaction  HW Best effort Use the checkpoint mechanism! Transactions == Software construct Checkpoint in case of failure Commit on successful transaction Executed speculative by a strand Use the cache store buffers and locks cache lines until commit ( tracking lines with the “s-bits” )

42 CBE PowerPC 9 Core chip CBE PowerPC 9 Core chip Power5 64 bit PowerPC 2 Core with SMT Power5 64 bit PowerPC 2 Core with SMT Codename: Rock 16 Core Processor, 32 Logical Threads Codename: Rock 16 Core Processor, 32 Logical Threads UltraSparc T2 Codename: Niagara 8 Core Processor, 64 Logical Threads UltraSparc T2 Codename: Niagara 8 Core Processor, 64 Logical Threads UltraSparc T1 Codename: Niagara 8 Core Processor, 32 Logical Threads UltraSparc T1 Codename: Niagara 8 Core Processor, 32 Logical Threads AMD Turion64 X2 IA32 x86 Dual Core Chip AMD Opteron Code Name: Denmark IA32 x86 2 Core Chip AMD Opteron Code Name: Denmark IA32 x86 2 Core Chip AMD Code Name: Barcelona IA32 x86 Native 4 Core Chip AMD Code Name: Barcelona IA32 x86 Native 4 Core Chip Codename: Sandy Bridge Intel Core 2 Codename: Penryn, Wolfdale IA32 x86 Dual & Quad Core Chip Intel Core 2 Codename: Penryn, Wolfdale IA32 x86 Dual & Quad Core Chip Intel Core 2 Duo IA32 x86 2 Core Chip Intel Core Duo IA32 x86 Dual Core Chip Intel Core Duo IA32 x86 Dual Core Chip Xeon Dual Core IA32 x86 2 Core Chip Pentium D IA32 x86 2 Core Chip Power 4 64 bit PowerPC 2 Core Power 4 64 bit PowerPC 2 Core Power 6 64 bit PowerPC 2 Core with SMT Power 6 64 bit PowerPC 2 Core with SMT Xenon 64 bit PowerPC 3 Core chip Xenon 64 bit PowerPC 3 Core chip Power7 Codename: Nehalem 1 to 8 Core Chip Codename: Nehalem 1 to 8 Core Chip IBM Intel AMD SUN Multi-core Trends in this Decade

43 Sources The Powerful and the Fallen Sinharoy, B et al, “Power5 System Microarchitecture”, IBM Journal of Research and Development, Vol 49, June/September 2005 Marr, D et al, “Hyper-Threading Technology Architecture and Microarchitecture” Intel Technology Journal, Vol 6, Issue 1, 2002 The Mutualists The Just Passing Andrews, Jeff and Baker, Nick “XBOX 360 System Architecture”, IEEE Micro, Volume 26, Issue 2 March 2006 The Olympic Sprinters Le, H.Q. et al, “Power6 System Microarchitecture,” IBM Journal of Research and Development, Vol 61, November 2007 The Threads’ Commune Konecny, P, “Introducing the Cray XMT,” May 5 th, 2007 Feo, J,“Can programmers and machines can ever be friends?” Breaking the Despotic Rule of the Lock Chaundhry, S, “Rock: A SPARC CMT Processor”, August 26, 2008


Download ppt "Joseph B. Manzano Spring 2009 Features that you (most probably) didn’t know your Microprocessor had."

Similar presentations


Ads by Google