SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Slides:

Advertisements

Similar presentations

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Advertisements

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.

Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.

ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.

Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Emery Berger and Mark Corner University of Massachusetts Amherst Computer Systems.

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

Pipelining and Parallelism Mark Staveley

Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

Hewlett-Packard PA-RISC Bit Processors: History, Features, and Architecture Presented By: Adam Gray Christie Kummers Joshua Madagan.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Assembly Language for Intel-Based Computers, 4 th Edition Chapter 2: IA-32 Processor Architecture (c) Pearson Education, All rights reserved. You.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

The University of Adelaide, School of Computer Science

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

What’s going on here? Can you think of a generic way to describe both of these?

Processor support devices Part 2: Caches and the MESI protocol

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Improving Memory Access 1/3 The Cache and Virtual Memory

Ramya Kandasamy CS 147 Section 3

Multi-core processors

How will execution time grow with SIZE?

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Architecture Background

Cache Memory Presentation I

William Stallings Computer Organization and Architecture 7th Edition

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

The University of Adelaide, School of Computer Science

Computer Architecture Lecture 4 17th May, 2006

CMPT 886: Computer Architecture Primer

Coe818 Advanced Computer Architecture

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

The University of Adelaide, School of Computer Science

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Lecture 17 Multiprocessors and Thread-Level Parallelism

Jakub Yaghob Martin Kruliš

The University of Adelaide, School of Computer Science

Presentation transcript:

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU

SYNAR Systems Networking and Architecture Group Outline Caches Branch prediction Out-of-order execution Instruction Level Parallelism

SYNAR Systems Networking and Architecture Group Caches Level 1 / Level 2 / Level 3 Instruction/Data or unified

SYNAR Systems Networking and Architecture Group Direct-Mapped Cache Line size = 32 bytes Cache eviction

SYNAR Systems Networking and Architecture Group Set-Associative Cache 4-way set associative cache The data can go into any of the four locations When the entire set is full, which line should we replace? LRU – least recently used (LRU stack)

SYNAR Systems Networking and Architecture Group Cache Hit/Miss Cache hit – the data is found in the cache Cache miss – the data is not in the cache Miss rate: – misses per instruction – misses per cycle – misses per access (also miss ratio) Hit rate: – the opposite

SYNAR Systems Networking and Architecture Group Cache Miss Latency How long you have to wait if you miss in the cache Miss in L1  L2 latency (~20 cycles) Miss in L2  memory latency (~300 cycles) (if there is no L3)

SYNAR Systems Networking and Architecture Group Writing in Cache Write through – write directly to memory Write back – write to memory later, when the line is evicted

SYNAR Systems Networking and Architecture Group Caches on Multiprocessor Systems Bus cache memory cache © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Processor Issues Load Request Bus cache memory cache data © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Another Processor Issues Load Request Bus cache memory cache data Bus I got data data Bus I want data © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group memory Bus Processor Modifies Data cache data Now other copies are invalid data © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Send Invalidation Message to Others memory Bus cache data Invalidate ! Bus Other caches lose read permission No need to change now: other caches can provide valid data © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Processor Asks for Data memory Bus cache data Bus I want data data © Herlihy-Shavit 2007

SYNAR Systems Networking and Architecture Group Shared Caches Filled on demand No control over cache shares An aggressive thread can grab a large cache share, hurt others Thread 1 Thread 2 Thread 1 Thread 2

SYNAR Systems Networking and Architecture Group NUMA Systems Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 Shared L3 Cache Core 2 L1, L2 cache NUMA Domain 2 Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache NUMA Domain 1 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache MC HT Memory node 2 HT MC Memory node 1 MC HT Memory node 3 HT MC Data T Threads TTTT TTTT TTTT TTTT

SYNAR Systems Networking and Architecture Group Outline Caches Branch prediction Out-of-order execution Instruction Level Parallelism

SYNAR Systems Networking and Architecture Group Branching and CPU Pipeline CPU pipeline

SYNAR Systems Networking and Architecture Group Branching Hurts Pipelining

SYNAR Systems Networking and Architecture Group Branch Prediction

SYNAR Systems Networking and Architecture Group Outline Caches Branch prediction Out-of-order execution Instruction Level Parallelism

SYNAR Systems Networking and Architecture Group Out-of-order Execution Modern CPUs are super-scalar They can issue more than one instructions per clock cycle If consecutive instructions depend on each other instruction-level parallelism is limited To keep the processor going at full speed, issue instructions out of order

SYNAR Systems Networking and Architecture Group Speculative Execution Out-of-order execution is limited to basic blocks To go beyond basic blocks, use speculative execution

SYNAR Systems Networking and Architecture Group Outline Caches Branch prediction Out-of-order execution Instruction Level Parallelism

SYNAR Systems Networking and Architecture Group Instruction-Level Parallelism Many programs fail to keep processor busy – Code with lots of loads – Code with frequent and unpredictable branches CPU cycles are wasted: power is consumed, no useful work is done Running multiple threads on the chip helps this