Features that you (most probably) didn’t know your Microprocessor had

Slides:

Advertisements

Similar presentations

Copyright © 2013 Elsevier Inc. All rights reserved.

Advertisements

CPU Structure and Function

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 30, 2002 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Tomasulo’s.

The University of Adelaide, School of Computer Science

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

Topic ? Course Overview. Guidelines Questions are rated by stars –One Star Question  Easy. Small definition, examples or generic formulas –Two Stars.

Chapter 18 Multicore Computers

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Superscalar Processors by

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.

Processor Level Parallelism 1

CS203 – Advanced Computer Architecture ILP and Speculation.

● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.

Use of Pipelining to Achieve CPI < 1

IBM System 360. Common architecture for a set of machines

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Microprocessor Microarchitecture Dynamic Pipeline

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Advantages of Dynamic Scheduling

Morgan Kaufmann Publishers The Processor

Vishwani D. Agrawal James J. Danaher Professor

Superscalar Processors & VLIW Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Checking for issue/dispatch

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

Control unit extension for data hazards

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Vishwani D. Agrawal James J. Danaher Professor

CSC3050 – Computer Architecture

Lecture 7 Dynamic Scheduling

The University of Adelaide, School of Computer Science

Presentation transcript:

Features that you (most probably) didn’t know your Microprocessor had Joseph B. Manzano Spring 2009

Outline The Powerful and the Fallen The Mutualists The Just Passing The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

The Powerful and The Fallen Multiple Issue Architectures: Increase your IPC / Take advantages of ILP Common Name Issue Structure Hazard Detection Scheduling Distinguishing characteristics Examples Superscalar (static) Dynamic Hardware Static In order execution Sun UltraSPARC II and III Superscalar (dynamic) hardware Some out of order execution IBM Power2 Superscalar (speculative) Dynamic With speculation Speculative out of order execution Pentium 3 and 4 VLIW / LIW Software No hazards between issues packets Trimedia, i860 EPIC Mostly Static Mostly Software Explicit Dependences marked by compiler Itanium Register Renaming Tomasulo Algorithm Reorder Buffer Scoreboarding

The Powerful and The Fallen Based on the CDC 6000 Architecture Scoreboarding Important Feature: Scoreboard Issue: WAW, Decode: RAW, execute and write results: WAR Reorder Buffer Tomasulo Algorithm Implemented in the IBM360/91’s floating point unit. Important Feature: Reservation Station and CDB Issue: tag if not available, copy if they are; Execute: stall RAW monitoring the CDB Write results: Send results to the CDB and dump the store buffer contents; Exception Handling: No insts can be issued until a branch can be resolved Register Renaming

The Powerful and The Fallen Dual Core Two way SMT IBM PowerPC SuperScalar Architecture. Picture Courtesy of IBM from “Power5 Microarchitecture”

The Powerful and The Fallen Intel Xeon Out of Order Engine Pipeline Picture Courtesy of Intel from “Hyper-Threading Technology Architecture and Microarchitecture”

Outline The Powerful and the Fallen The Mutualists The Just Passing The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

The Mutualists Vector Processing Super Computer of the past SIMD type of design Elements of the data stream are worked by a single type of instruction Simplifies hardware design Moving toward more “general” purpose vector processing

The Mutualists PPE Created by STI The Cell Broadband Engine Composed of nine computing elements A modified Vector Arch Limited memory: 256 KiB All accesses are to and from this local memory Main Memory Accesses  DMA transfers The brain of the system Organizer Runs Linux PowerPC dual issue arch Each SPE has a MFC unit Issue and receive DMA to and from main memory Gate Keeper of the bus BEI Flex IO Memory Interface SPE PPSS PPE MFC Four rings Has QoS in a limited fashion (RAM) Maintain coherency and consistency between all memory units (the MFC, main memory and PPE caches, but not across the local memory of SPEs)

Outline The Powerful and the Fallen The Mutualists The Just Passing The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

The Just Passing Cache  “Invisible” architecture component Not so much in the last years PowerPC and other architecture provides instructions to control dcbf[e], dcbst[e], dcbz[e], icbi[e], isync Instruction available to touch, to zeroed out, to reserve, or to lock a line in place. But for some interesting designs look no further than …

The Just Passing XBOX 360 Xenon Architectures Picture Courtesy of IBM from ”XBOX 360 System Microarchitecture”

Outline The Powerful and the Fallen The Mutualists The Just Passing The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

The Olympic Sprinters The Hertz race is over; however … Some processors are still at it … Power 6 and 7 running at 4 and 5 GHz Intel Polaris: 3.6 to 6 GHz Many hardware re-designs are in order Make pipelines shorter, simpler Get rid of “extra” hardware features

The Olympic Sprinters Power6 Running at frequencies from 4 to 5 GHz Pictures Courtesy of Intel from “IBM Power6 Microarchitecture” Power6 Running at frequencies from 4 to 5 GHz 13 FO4 versus 23 FO4 pipeline

Outline The Powerful and the Fallen The Mutualists The Just Passing The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

The Threads’ Commune Large shared memory systems are becoming scarce Scalability issues due to synchronization Contention Coherency and Consistency Novel Solutions have emerged Explicit memory hierarchies with very weak memory models Massive Multithreading on chip Synchronization in memory

The Threads’ Commune Cray XMT An example of a very high SMT design 128 Hardware streams A stream is 31 64-bit registers, 8 target registers, and a control register Three functional units: M, A and C 500 MHz Full and Empty bits per word (2-bits) An example of a very high SMT design

The Threads’ Commune SMT / HT designs Time Issue Slot Super Scalar Coarse MT Fine MT SMT http://www.intel.com/technology/computing/dual-core/demo/popup/demo.htm

The Threads’ Commune Cray MTA2 picture from Jonh Feo’s “Can programmers and Machines ever be friends”

The Threads’ Commune Data Race or Race Condition “There is an anomaly of concurrent accesses by two or more threads to a shared memory and at least one of the accesses is a write” The orchestration of two or more threads (or processes) to complete a task in a correct manner and to avoid any data races Problems Separation of lock and guarded data

The Threads’ Commune Coherency and Consistency Caching elements and make sure that everyone sees the last copy If an element is written by processor A then how processor B and C will know that they have the latest copy? Very difficult problem! One of the scalability problems of Shared memory

The Threads’ Commune How Cray XMT solves these problems? For Synchronization: Join the lock with each data word and put the synchronization requirement on the memory instead that the processor For coherence and consistency: DO NOT cache remote data (outside the local 8 GiB)

Outline The Powerful and the Fallen The Mutualists The Just Passing The Olympic Sprinters The Threads’ Commune Breaking the Despotic Rule of the Lock

Breaking the Despotic Rule of the Lock Synchronization Atomicity and Seriability Locks and Barriers Around hundreds to ten thousands of cycles and grows linearly (in the best cases) or polynomial (in the worst cases) with the number of processors The lock The most used synch primitive! Alternatives: Lock-free data structures

Breaking the Despotic Rule of the Lock Lock Free Data Structures Used to implement non blocking or / and wait free algorithms Prevents deadlocks, livelocks and priority inversions Potential problems: ABA problem It tells us no-one is working on this now, but not if someone has done it before Transactional Memory Based on transactions (an atomic bundle operations) If two transactions conflict then one is bound to fail

Side Note A Review of LL and SC PowerPC and many other architecture instructions Provide a way to optimistically execute a piece of code In case that a “violation” has taken place, discard your results Many implementations PowerPC: lwarx and stwcx

Side Note The LL and SC behavior The lwarx instruction Loads a word aligned location Side Effects: A reservation is created Storage coherence mechanism is notified that a reservation exists The stwcx instruction Conditionally Store a location to a given memory location. Conditionally  Depends on the reservation If success, all changes will be committed to memory If not, changes will be discarded.

Side Note Reservations At most one per processor A reservation is lost when Processor holding the reservation executes A lwarx or ldarx A stwcx or stdcx (No matter if the reservation matches or not) Other processors executes A store or a dcbz to the granule Some other mechanism modifies a storage location in the same reservation granule Interrupts does not clean reservations But interrupt handlers might Granularity The length of the memory block to keep under surveillance

Side Note Examples LL a = ? a *= 100; … SC a brnz Memory Storage Mechanism a = ?

Side Note Examples LL a = ? LL a = ? a *= 100; a += 100; SC a SC a brnz brnz a a Memory Storage Mechanism a = ?

Side Note Examples LL a = ? LL a = ? a *= 100; a += 100; a = 100; SC a brnz brnz X X Memory Storage Mechanism a = 100

Side Note Examples LL a = ? LL a = ? a *= 100; a += 100; SC a SC a brnz brnz X X Memory Storage Mechanism a = 100

Side Note Examples LL a = ? LL a = 100 a *= 100; a += 100; SC a SC a brnz brnz X a Memory Storage Mechanism a = 100

Side Note Examples LL a = 100 LL a = 100 a *= 100; a += 100; SC a SC a brnz brnz a a Memory Storage Mechanism a = 100

Side Note Examples LL a = 100 LL a = 100 a *= 100; a += 100; SC a SC a brnz brnz X a Memory Storage Mechanism a = 200

Side Note Examples LL a = 100 a *= 100; SC a brnz Memory Storage Mechanism a = 200

Side Note Examples LL a = 200 a *= 100; SC a brnz Memory Storage Mechanism a = 200

Side Note Examples LL a = 200 a *= 100; SC a brnz Memory Storage Mechanism a =20000

Breaking the Despotic Rule of the Lock Sun Rock Processor Execute Ahead Scouting Threads Simultaneous Multithreading Transactional Memory Checkpoint Cache memory with extra bits for tracking speculative execution 32 logical threads and 16 physical cores Pictures courtesy of “Rock: A SPARC CMT Processor”

Breaking the Despotic Rule of the Lock Take a “RISC”-y Approach Small transaction  HW Best effort Use the checkpoint mechanism! Transactions == Software construct Checkpoint in case of failure Commit on successful transaction Executed speculative by a strand Use the cache store buffers and locks cache lines until commit ( tracking lines with the “s-bits” )

Multi-core Trends in this Decade UltraSparc T1 Codename: Niagara 8 Core Processor, 32 Logical Threads Multi-core Trends in this Decade Codename: Rock 16 Core Processor, 32 Logical Threads AMD Turion64 X2 IA32 x86 Dual Core Chip Intel Core Duo IA32 x86 Dual Core Chip Intel Core 2 Codename: Penryn, Wolfdale IA32 x86 Dual & Quad Core Chip Pentium D IA32 x86 2 Core Chip Power5 64 bit PowerPC 2 Core with SMT CBE PowerPC 9 Core chip Power7 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Power 4 64 bit PowerPC 2 Core Codename: Nehalem 1 to 8 Core Chip Xenon 64 bit PowerPC 3 Core chip Power 6 64 bit PowerPC 2 Core with SMT Xeon Dual Core IA32 x86 2 Core Chip Intel Core 2 Duo IA32 x86 2 Core Chip Codename: Sandy Bridge IBM AMD Opteron Code Name: Denmark IA32 x86 2 Core Chip AMD Code Name: Barcelona IA32 x86 Native 4 Core Chip Intel UltraSparc T2 Codename: Niagara 8 Core Processor, 64 Logical Threads AMD SUN

Sources The Powerful and the Fallen The Mutualists The Just Passing Sinharoy, B et al, “Power5 System Microarchitecture”, IBM Journal of Research and Development, Vol 49, June/September 2005 Marr, D et al, “Hyper-Threading Technology Architecture and Microarchitecture” Intel Technology Journal, Vol 6, Issue 1, 2002 The Mutualists The Just Passing Andrews, Jeff and Baker, Nick “XBOX 360 System Architecture”, IEEE Micro, Volume 26, Issue 2 March 2006 The Olympic Sprinters Le, H.Q. et al, “Power6 System Microarchitecture,” IBM Journal of Research and Development, Vol 61, November 2007 The Threads’ Commune Konecny, P, “Introducing the Cray XMT,” May 5th, 2007 Feo, J ,“Can programmers and machines can ever be friends?” Breaking the Despotic Rule of the Lock Chaundhry, S, “Rock: A SPARC CMT Processor”, August 26, 2008