Download presentation
Presentation is loading. Please wait.
1
Theo Ungerer Systems and Networking University of Augsburg ungerer@informatik.uni-augsburg.de http://www.informatik.uni-augsburg.de/sik/ Opportunities for Hardware Multithreading in Microprocessors and Microcontrollers
2
2 Basic Principle of Multithreading Register set 1 Register set 2 Register set 3 Register set 4 PC PSR 1 PC PSR 2 PC PSR 3 PC PSR 4 Thread pointer thread 1: thread 2: thread 3: thread 4:...
3
3 Multithreading in High Performance Processors Multithreading in high-performance microprocessors IBM RS64 IV (SStar) Sun UltraSPARC V Intel Xeon TM Hardware multithreading is the ability to pursue more than one thread within a processor pipeline. Typically features: multiple register sets, fast context switching Main objective: performance gain by latency hiding for multithreaded workloads
4
4 Motivation State-of-the-art Multithreading Multithreading for throughput increase Multithreading for power reduction Multithreading for embedded real-time systems Conclusions & Research Opportunities Outline of the Presentation
5
5 Todays Multiple-issue Processors Utilization of instruction level parallelism by a long instruction pipeline and by the superscalar or the VLIW-/EPIC-technique.
6
6 Problem: Low Resource Utilization by Sequential Programs processor cycles issue slots vertical loss (= 4) horizontal loss = 2 horizontal loss = 1 horizontal loss = 3 Losses by empty issue slots
7
7 Outline of the Presentation Motivation State-of-the-art Multithreading Multithreading for throughput increase Multithreading for power reduction Multithreading for embedded real-time systems Conclusions & Research Opportunities
8
8 Multithreading Two basic multithreading techniques Interleaved Multithreading Block Multithreading Simultaneous multithreading (SMT) combines wide issue superscalar with multithreading, issues instructions from several threads simultaneously.
9
9 Basic Multithreading Techniques Single threadInterleaved MTBlock MT
10
10 SMT vs. CMP SMTCMP
11
11 Characteristics of Multithreading Latency Utilization The latencies that arise in the computation of a single instruction stream are filled by computations of another thread. Throughput of multithreaded workloads is increased Power Reduction Using less speculation Rapid Context Switching appropriate for real-time applications
12
12 Outline of the Presentation Motivation State-of-the-art Multithreading Multithreading for throughput increase Multithreading for power reduction Multithreading for embedded real-time systems Conclusions & Research Opportunities
13
13 Multithreading for Throughput Increase Lots of research results with simulated SMT since 1995 Some of our own research results Performance estimation of SMT multimedia Regard transistor count and chip-space estimation of the models.
14
14 Relevant Attributes for Rating Microprocessors PerformanceResource Requirement Clock SpeedPower Consumption Two tools Performance estimation tool Transistor count and chip-space estimation tool
15
15 Transistor Count and Chip-space Estimator Vision: The resources of the baseline model should be adjusted such that the same chip space or the same transistor count is covered as in the new microachitecture models. We use an analytical method for memory-based structures like register files or internal queues and an empirical method for logic blocks like control logic and functional units. half-feature size as measure of length of basic cell Estimator tool is available (also for SimpleScalar) at: http://www.informatik.uni-augsburg.de/lehrstuehle/info3/research/complexity/
16
16 Execution-based Simulator: Baseline SMT Multimedia Processor Model
17
17 Results of Performance and Hardware Cost Estimation Demonstrated by two set of models: „Maximum“ processor models with an abundance of resources Small processor models Workload is a MPEG-2 decoder made multithreaded1 2
18
18 Simulation Parameters Fixed parameters: 1024-entry BTAC, gshare branch predictor (2 K 2-bit counters, 8 bit history, mispred. pen. 5 cycles) 4-way set-associative D- and I-caches with 32 byte cache lines 32 KB local on-chip RAM 64-bit system bus, 4 MB main memory Varied parameters: 8-12 execution units 256- and 32-entry reservation stations 10 to 4 result buses different D-cache sizes, D- and I-caches of 4 MB and 64 KB Parameters Varied with Number of Threads: 32 32-bit general-purpose registers and 40 rename registers (per thread), 32- and 16-entry issue and retirement buffers (per thread) Fetch and decode bandwidth is scaled with issue bandwidth and number of threads: 1x1 – 8x8
19
19 Performance vs. Hardware Cost Estimation: Maximum Processor Models 4 MB I- and D-caches, 6 integer/mm units 2 local load/store units1
20
Transistor Count and Chip Space Estimation of Maximum Processor Models1
21
21 Small Processor Models 64 KB I- and D-caches, 3 integer/mm units 1 local load/store unit 32-enty reserv. stations 16-entry issue and retirement buffers 4 result buses 2x4 fetch and decode bandwidth fixed2
22
Transistor Count and Chip Space Estimation of Small Processor Models2
23
23 Results 4-threaded 8-issue SMT over a single-threaded 8-issue: Commercial Multithreaded Processors: Tera, MAJC, Alpha 21464, IBM Blue Gene, Sun UltraSPARC V Network processors (Intel IXP, IBM PowerNP, Vitesse IQ2x00, Lextra,..) IBM RS64 IV: two-threaded block MT, reported 5% overhead Intel Xeon TM (hyperthreading): two-threaded SMT, reported 5% overhead Speedup Transistor Chip Space Increase Increase maximum model: 3 2% 9% small model: 1.5 9% 27%
24
24 Outline of the Presentation Motivation State-of-the-art Multithreading Multithreading for throughput increase Multithreading for power reduction Multithreading for embedded real-time systems Conclusions & Research Opportunities
25
25 SMT for Reduction of Power Consumption Observation: Mispredictions cost energy Todays superscalars: ~ 60% of the fetched and ~ 30% of the executed instructions are squashed Idea: fill issue slots by less speculative instructions of other threads Simulations of Seng et al. 2000 show that ~ 22% less energy is consumed by using a power-aware scheduler
26
26 Outline of the Presentation Motivation State-of-the-art Multithreading Multithreading for throughput increase Multithreading for power reduction Multithreading for embedded real-time systems Conclusions & Research Opportunities
27
27 Multithreading in Embedded Real-time Systems – The Komodo Approach Observation: multithreading allows a context switching overhead of zero cycles Idea: harness multithreading for embedded real-time systems è Komodo Project: Real-time Java Based on a Multithreaded Java-microcontroller http://www.informatik.uni-augsburg.de/lehrstuehle/info3/research/ komodo/indexEng.html
28
28 Real-time Requirements run-time predictability isolation of the threads programmability real-time scheduling support fast context switching Hard real-time: a deadline may never be missed Soft real-time: a deadline may occasionally be missed
29
29 Komodo Solutions Extremely fast context switching by hardware multithreading Real-time scheduling in hardware Based on a Java processor core Predictability of all instruction executions by a careful hardware design
30
30 Komodo Microcontroller Pipeline
31
31 Komodo Microcontroller Design
32
32 Hardware Real-time Scheduling Real-time scheduler is realized in hardware (by the priority manager) Scheduling decision every clock cycle Four different scheduling algorithms implemented: Fixed Priority Preemptive (FPP) Earliest Deadline First (EDF) Least Laxity First (LLF) Guaranteed Percentage (GP)
33
33 Guaranteed Percentage Scheme ional processor on a multithreaded processor context switch surplus tionviola
34
34 Simulation Results thread mix (IC, PID, and FFT) applied
35
35 Technical Data of the Komodo Prototype q Implementation of Komodo core pipeline on a Xilinx XCV800 with 800k gates ASIC synthesis of whole microcontroller (0.18 m technology): 340 MHz, 3 mm 2 chip data bit width address space number of threads instruction window size stack size external frequency internal frequency CLBs number of gates 32 bit 19 bit 4 8 bytes 128 entries 33 MHz 8.25 MHz 9 200 133 000
36
36 Chip-Space of Komodo Core Pipeline
37
37 Reducing Power Consumption Using Real-time Scheduling in Hardware Current work: Idea: Use information about the thread states and configurations available within the priority manager for a „fine-grained“ adaption of power consumption and performance. Frequency and voltage adjustments in short time intervals done by hardware
38
38 State of the Komodo Project Software simulator FPGA prototyp Real-time Java system - ASIC - Middleware for distributed embedded systems
39
39 Conclusions on Multithreading in Real-time Environments Multithreaded processor cores: Performance gain due to fast context switching (for hard real- time) and latency hiding (for soft and non real-time) More efficient event handling by ISTs Helper threads possible (garbage collection, debugging) Real-time scheduling in hardware: Software overhead for real-time scheduling removed more efficient power saving mechanisms possible better predictablility by isolation of threads (GP scheduling)
40
40 Conclusions & Research Opportunities Multithreading proves advantageous: Latency hiding: speed-ups of 2-3 for SMT, lots of research done, next generation of microprocessors Power reduction: 22% savings reported, not much research up to now Fast context switching utilized by microcontroller for real-time systems, not much research up to now Research opportunities: Scheduling in SMT, network processors and multithreaded real-time systems Thread-speculation: how to speed-up single-threaded programs? Multithreading and power consumption Multithreading in other communities: microcontrollers, SoCs System software based on helper threads
41
41 Acknowledgements SMT Multimedia research group Uli Sigmund and Heiko Oehring Complexity estimation group Marc Steinhaus, Reiner Kolla, Josep L. Larriba-Pey, Mateo Valero Komodo project group Jochen Kreuzinger, Matthias Pfeffer, Sascha Uhrig, Uwe Brinkschulte, Florentin Picioroaga, Etienne Schneider
42
42 Mikroprozessors: Technology Prognosis up to 2012 SIA (semiconductor industries association) Prognose 1997:
43
43 Research Directions? Increase performance of a single thread of control by more instruction-level speculation -Better branch prediction, -Trace cache and next trace prediction, -Data dependence and value prediction Increase throughput of a workload of multiple threads Utilize thread-level and instruction-level parallelism -Chip-Multiprocessors -Multithreading (hardware thread = thread or process) Thread speculation
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.