CS 203A Computer Architecture Lecture 10: Multimedia and Multithreading Instructor: L.N. Bhuyan.

Slides:



Advertisements
Similar presentations
Multithreaded Processors
Advertisements

Multithreading Processors and Static Optimization Review Adapted from Bhuyan, Patterson, Eggers, probably others.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading II Steve Ko Computer Sciences and Engineering University at Buffalo.
CS 252 Graduate Computer Architecture Lecture 13: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California,
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
April 6, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California,
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Chapter 17 Parallel Processing.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading I Steve Ko Computer Sciences and Engineering University at Buffalo.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
March 16, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
COMP Multithreading. Coarse Grain Multithreading Minimal pipeline changes – Need to abort instructions in “shadow” of miss – Resume instruction.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
© Krste Asanovic, 2015CS252, Spring 2015, Lecture 13 CS252 Graduate Computer Architecture Fall 2015 Lecture 13: Multithreading Krste Asanovic
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Computer performance issues* Pipelines, Parallelism. Process and Threads.
Lecture on Central Process Unit (CPU)
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
CENG331 Computer Organization Multithreading
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
Processor Level Parallelism 1
1 Lecture 5a: CPU architecture 101 boris.
Chapter Six.
COMP 740: Computer Architecture and Implementation
Electrical and Computer Engineering
Mehmet Altan Açıkgöz Ercan Saraç
Krste Asanovic Electrical Engineering and Computer Sciences
Simultaneous Multithreading
Simultaneous Multithreading
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Vector Processing => Multimedia
Hyperthreading Technology
Levels of Parallelism within a Single Processor
Hardware Multithreading
Comparison of Two Processors
Chapter Six.
Chapter Six.
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
Presentation transcript:

CS 203A Computer Architecture Lecture 10: Multimedia and Multithreading Instructor: L.N. Bhuyan

Approaches to Mediaprocessing Multimedia Processing General-purpose processors with SIMD extensions Vector Processors VLIW with SIMD extensions (aka mediaprocessors) DSPs ASICs/FPGAs

What is Multimedia Processing? Desktop: –3D graphics (games) –Speech recognition (voice input) –Video/audio decoding (mpeg-mp3 playback) Servers: –Video/audio encoding (video servers, IP telephony) –Digital libraries and media mining (video servers) –Computer animation, 3D modeling & rendering (movies) Embedded: –3D graphics (game consoles) –Video/audio decoding & encoding (set top boxes) –Image processing (digital cameras) –Signal processing (cellular phones)

Characteristics of Multimedia Apps (1) Requirement for real-time response –“Incorrect” result often preferred to slow result –Unpredictability can be bad (e.g. dynamic execution) Narrow data-types –Typical width of data in memory: 8 to 16 bits –Typical width of data during computation: 16 to 32 bits –64-bit data types rarely needed –Fixed-point arithmetic often replaces floating-point Fine-grain (data) parallelism –Identical operation applied on streams of input data –Branches have high predictability –High instruction locality in small loops or kernels

Characteristics of Multimedia Apps (2) Coarse-grain parallelism –Most apps organized as a pipeline of functions –Multiple threads of execution can be used Memory requirements –High bandwidth requirements but can tolerate high latency –High spatial locality (predictable pattern) but low temporal locality –Cache bypassing and prefetching can be crucial

SIMD Extensions for GPP Motivation –Low media-processing performance of GPPs –Cost and lack of flexibility of specialized ASICs for graphics/video –Underutilized datapaths and registers Basic idea: sub-word parallelism –Treat a 64-bit register as a vector of 2 32-bit or 4 16-bit or 8 8-bit values (short vectors) –Partition 64-bit datapaths to handle multiple narrow operations in parallel Initial constraints –No additional architecture state (registers) –No additional exceptions –Minimum area overhead

Overview of SIMD Extensions VendorExtensionYear# InstrRegisters HPMAX-1 and 2 94,959,8 (int)Int 32x64b SunVIS95121 (int)FP 32x64b IntelMMX9757 (int)FP 8x64b AMD3DNow!9821 (fp)FP 8x64b MotorolaAltivec98162 (int,fp)32x128b (new) IntelSSE9870 (fp)8x128b (new) MIPSMIPS-3D?23 (fp)FP 32x64b AMDE 3DNow!9924 (fp)8x128 (new) IntelSSE (int,fp)8x128 (new)

Intel MMX Piipeline

Performance Improvement in MMX Architecture

SIMD Performance Limitations Memory bandwidth Overhead of handling alignment and data width adjustments

Other Features for Multimedia Support for fixed-point arithmetic –Saturation, rounding-modes etc Permutation instructions of vector registers –For reductions and FFTs –Not general permutations (too expensive) Example: permutation for reductions –Move 2 nd half a a vector register into another one –Repeatedly use with vadd to execute reduction –Vector length halved after each step V V1

Multithreading Consider the following sequence of instructions through a pipeline LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12 SW 12(r1), r5

Multithreading How can we guarantee no dependencies between instructions in a pipeline? –One way is to interleave execution of instructions from different program threads on same pipeline – Micro context switching Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1)

Avoiding Memory Latency General processors switch to another context on I/O operation => Multithreading, Multiprogramming, etc. An O/S function. Large overhead! Why? Why not context switch on a cache miss? => Hardware multithreading. Can we afford that overhead now? => Need changes in architecture to avoid stack operations. How to achieve it? Have many contexts CPU resident (not memory resident) by having separate PCs and registers for each thread. No need to store them in stack on context switching.

Simple Multithreaded Pipeline Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage

Multithreading Costs Appears to software (including OS) as multiple slower CPUs Each thread requires its own user state –GPRs –PC Also, needs own OS control state –virtual memory page table base register –exception handling registers Other costs?

What “Grain” Multithreading? So far assumed fine-grained multithreading –CPU switches every cycle to a different thread –When does this make sense? Coarse-grained multithreading –CPU switches every few cycles to a different thread –When does this make sense (Ex - Memory Access? – NPs)?

Superscalar Machine Efficiency Why horizontal waste? Why vertical waste?

Vertical Multithreading Cycle-by-cycle interleaving of second thread removes vertical waste

Ideal Multithreading for Superscalar Interleave multiple threads to multiple issue slots with no restrictions

Simultaneous Multithreading Add multiple contexts and fetch engines to wide out-of-order superscalar processor –[Tullsen, Eggers, Levy, UW, 1995] OOO instruction window already has most of the circuitry required to schedule from multiple threads Any single thread can utilize whole machine

Comparison of Issue Capabilities Courtesy of Susan Eggers; Used with Permission

From Superscalar to SMT Small items –per-thread program counters –per-thread return stacks –per-thread bookkeeping for instruction retirement, trap & instruction dispatch queue flush –thread identifiers, e.g., with BTB & TLB entries

Simultaneous Multithreaded Processor

Intel Pentium-4 Xeon Processor Hyperthreading == SMT Dual physical processors, each 2-way SMT Logical processors share nearly all resources of the physical processor –Caches, execution units, branch predictors Die area overhead of hyperthreading ~5 % When one logical processor is stalled, the other can make progress –No logical processor can use all entries in queues when two threads are active A processor running only one active software thread to run at the same speed with or without hyperthreading

Intel Hyperthreading Implementation – See attached paper Note separate buffer space/registers for the second thread

Intel Xeon Performance