Hyperthreading Technology

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Chapter 17 Parallel Processing.

1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.

How Multi-threading can increase on-chip parallelism

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Chapter 18 Multicore Computers

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Hyper-Threading Technology Architecture and Micro-Architecture.

Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.

– Mehmet SEVİK – Yasin İNAĞ

Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.

Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Hyper-Threading Technology Architecture and Microarchitecture

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Dynamic Scheduling Why go out of style?

COMP 740: Computer Architecture and Implementation

Lynn Choi School of Electrical Engineering

Simultaneous Multithreading

Lynn Choi School of Electrical Engineering

Simultaneous Multithreading

Multi-core processors

Computer Structure Multi-Threading

INTEL HYPER THREADING TECHNOLOGY

Architecture & Organization 1

5.2 Eleven Advanced Optimizations of Cache Performance

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Lecture: SMT, Cache Hierarchies

Computer Architecture: Multithreading (I)

Instruction Scheduling for Instruction-Level Parallelism

The Microarchitecture of the Pentium 4 processor

Architecture & Organization 1

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Hardware Multithreading

Lecture: SMT, Cache Hierarchies

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Lecture: SMT, Cache Hierarchies

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Lecture: SMT, Cache Hierarchies

Levels of Parallelism within a Single Processor

CS 286 Computer Organization and Architecture

8 – Simultaneous Multithreading

Presentation transcript:

Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/

Outline What is hyperthreading? Trends in microarchitecture Exploiting thread-level parallelism Hyperthreading architecture Microarchitecture choices and trade-offs 9/20/2018  A. Milenkovic

What is hyperthreading? SMT - Simultaneous multithreading Make one physical processor appear as multiple logical processors to the OS and software Intel Xeon for the server market, early 2002 Pentium4 for the consumer market, November 2002 Motivation: boost performance for up to 25% at the cost of 5% increase in additional die area Hyperthreading brings the SMT concept to the Intel architecture. First introduced in the Intel Xeon processor (serves the server market), then in the Pentium 4 processor. Goal: improve performance at minimal cost. One physical processor appears as multiple logical processors to the operating system and software. 9/20/2018  A. Milenkovic

Trends in microarchitecture Higher clock speeds To achieve high clock frequency make pipeline deeper (superpipelining) Events that disrupt pipeline (branch mispredictions, cache misses, etc) become very expensive in terms of lost clock cycles ILP: Instruction Level Parallelism Extract parallelism in a single program Superscalar processors have multiple execution units working in parallel Challenge to find enough instructions that can be executed concurrently Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order 9/20/2018  A. Milenkovic

Trends in microarchitecture Cache hierarchies Processor-memory speed gap Use caches to reduce memory latency Multiple levels of caches: smaller and faster closer to the processor core Thread-level Parallelism Multiple programs execute concurrently Web-servers have an abundance of software threads Users: surfing the web, listening to music, encoding/decoding video streams, etc. 9/20/2018  A. Milenkovic

Exploiting thread-level parallelism CMP – Chip Multiprocessing Multiple processors, each with a full set of architectural resources, reside on the same die Processors may share an on-chip cache or each can have its own cache Examples: HP Mako, IBM Power4 Challenges: Power, Die area (cost) Time-slice multithreading Processor switches between software threads after a predefined time slice Can minimize the effects of long lasting events Still, some execution slots are wasted 9/20/2018  A. Milenkovic

Exploiting thread-level parallelism Switch-on-event multithreading Processor switches between software threads after an event (e.g., cache miss) Works well, but still coarse-grained parallelism (e.g., data dependences and branch mispredictions are still wasting cycles) SMT – Simultaneous multithreading Multiple software threads execute on a single processor without switching Have potential to maximize performance relative to the transistor count and power 9/20/2018  A. Milenkovic

Hyperthreading architecture One physical processor appears as multiple logical processors HT implementation on NetBurst microarchitecture has 2 logical processors Architectural State Architectural State Architectural state: general purpose registers control registers APIC: advanced programmable interrupt controller Processor execution resources 9/20/2018  A. Milenkovic

Hyperthreading architecture Main processor resources are shared caches, branch predictors, execution units, buses, control logic Duplicated resources register alias tables (map the architectural registers to physical rename registers) next instruction pointer and associated control logic return stack pointer instruction streaming buffer and trace cache fill buffers 9/20/2018  A. Milenkovic

Die Size and Complexity 9/20/2018  A. Milenkovic

Resources sharing schemes Partition – dedicate equal resources to each logical processors Good when expect high utilization and somewhat unpredicatable Threshold – flexible resource sharing with a limit on maximum resource usage Good for small resources with bursty utilization and when the micro-ops stay in the structure for short predictable periods Full sharing – flexible with no limits Good for large structures, with variable working-set sizes 9/20/2018  A. Milenkovic

Shared vs. partitioned queues Cycle 0: 2 light shaded (fast) and 2 dark shaded (slow) microoperations Cycle 1: send light shaded micro-op 0 down the pipeline In the shared queue the previous pipeline stage sends dark shaded micro-op 2, but in the partitioned queue the slow thread is already occupying two entries, so the previous stage will send light shaded micro-op 2. (c) Cycle 3: Light micro-op 1 is sent down the pipeline. Shared queue will accept light shaded micro-op 2, while the partitioned will accept the next light shaded micro-op 3. (d) Cycle 3: light shaded micro-op 2 is sent down the pipeline. In shared queue, dark shaded micro-op 3 is entering the queue, but in the partitioned we will have to accept another light shaded micro-op 4. (e) Finally, the shared queue keeps all dark shaded micro-ops (slow) and will block the other thread. With the partitioned queue we do not have such problems. shared partitioned 9/20/2018  A. Milenkovic

NetBurst Pipeline threshold partitioned 9/20/2018  A. Milenkovic

Shared vs. partitioned resources E.g., major pipeline queues Threshold Puts a threshold on the number of resource entries a logical processor can have E.g., scheduler Fully shared resources E.g., caches Modest interference Benefit if we have shared code and/or data 9/20/2018  A. Milenkovic

Scheduler occupancy 9/20/2018  A. Milenkovic

Shared vs. partitioned cache 9/20/2018  A. Milenkovic

Performance Improvements 9/20/2018  A. Milenkovic