Theo Ungerer Systems and Networking University of Augsburg Opportunities.

Theo Ungerer Systems and Networking University of Augsburg ungerer@informatik.uni-augsburg.de http://www.informatik.uni-augsburg.de/sik/ Opportunities for Hardware Multithreading in Microprocessors and Microcontrollers

2 Basic Principle of Multithreading Register set 1 Register set 2 Register set 3 Register set 4 PC PSR 1 PC PSR 2 PC PSR 3 PC PSR 4 Thread pointer thread 1: thread 2: thread 3: thread 4:...

3 Multithreading in High Performance Processors Multithreading in high-performance microprocessors  IBM RS64 IV (SStar)  Sun UltraSPARC V  Intel Xeon TM Hardware multithreading is the ability to pursue more than one thread within a processor pipeline. Typically features: multiple register sets, fast context switching Main objective: performance gain by latency hiding for multithreaded workloads

4  Motivation  State-of-the-art  Multithreading Multithreading for throughput increase Multithreading for power reduction Multithreading for embedded real-time systems  Conclusions & Research Opportunities Outline of the Presentation

5 Todays Multiple-issue Processors Utilization of instruction level parallelism by a long instruction pipeline and by the superscalar or the VLIW-/EPIC-technique.

6 Problem: Low Resource Utilization by Sequential Programs processor cycles issue slots vertical loss (= 4) horizontal loss = 2 horizontal loss = 1 horizontal loss = 3 Losses by empty issue slots

7 Outline of the Presentation  Motivation  State-of-the-art  Multithreading Multithreading for throughput increase Multithreading for power reduction Multithreading for embedded real-time systems  Conclusions & Research Opportunities

8 Multithreading  Two basic multithreading techniques Interleaved Multithreading Block Multithreading  Simultaneous multithreading (SMT) combines wide issue superscalar with multithreading, issues instructions from several threads simultaneously.

9 Basic Multithreading Techniques Single threadInterleaved MTBlock MT

10 SMT vs. CMP SMTCMP

11 Characteristics of Multithreading  Latency Utilization The latencies that arise in the computation of a single instruction stream are filled by computations of another thread.  Throughput of multithreaded workloads is increased  Power Reduction Using less speculation  Rapid Context Switching appropriate for real-time applications

12 Outline of the Presentation  Motivation  State-of-the-art  Multithreading  Multithreading for throughput increase Multithreading for power reduction Multithreading for embedded real-time systems  Conclusions & Research Opportunities

13 Multithreading for Throughput Increase  Lots of research results with simulated SMT since 1995  Some of our own research results Performance estimation of SMT multimedia Regard transistor count and chip-space estimation of the models.

14 Relevant Attributes for Rating Microprocessors PerformanceResource Requirement Clock SpeedPower Consumption  Two tools Performance estimation tool Transistor count and chip-space estimation tool

15 Transistor Count and Chip-space Estimator  Vision: The resources of the baseline model should be adjusted such that the same chip space or the same transistor count is covered as in the new microachitecture models.  We use an analytical method for memory-based structures like register files or internal queues and  an empirical method for logic blocks like control logic and functional units.  half-feature size as measure of length of basic cell  Estimator tool is available (also for SimpleScalar) at: http://www.informatik.uni-augsburg.de/lehrstuehle/info3/research/complexity/

16 Execution-based Simulator: Baseline SMT Multimedia Processor Model

17 Results of Performance and Hardware Cost Estimation  Demonstrated by two set of models: „Maximum“ processor models with an abundance of resources Small processor models Workload is a MPEG-2 decoder made multithreaded1 2

18 Simulation Parameters  Fixed parameters: 1024-entry BTAC, gshare branch predictor (2 K 2-bit counters, 8 bit history, mispred. pen. 5 cycles) 4-way set-associative D- and I-caches with 32 byte cache lines 32 KB local on-chip RAM 64-bit system bus, 4 MB main memory  Varied parameters: 8-12 execution units 256- and 32-entry reservation stations 10 to 4 result buses different D-cache sizes, D- and I-caches of 4 MB and 64 KB  Parameters Varied with Number of Threads: 32 32-bit general-purpose registers and 40 rename registers (per thread), 32- and 16-entry issue and retirement buffers (per thread) Fetch and decode bandwidth is scaled with issue bandwidth and number of threads: 1x1 – 8x8

19 Performance vs. Hardware Cost Estimation: Maximum Processor Models 4 MB I- and D-caches, 6 integer/mm units 2 local load/store units1

Transistor Count and Chip Space Estimation of Maximum Processor Models1

21 Small Processor Models 64 KB I- and D-caches, 3 integer/mm units 1 local load/store unit 32-enty reserv. stations 16-entry issue and retirement buffers 4 result buses 2x4 fetch and decode bandwidth fixed2

Transistor Count and Chip Space Estimation of Small Processor Models2

23 Results  4-threaded 8-issue SMT over a single-threaded 8-issue:  Commercial Multithreaded Processors: Tera, MAJC, Alpha 21464, IBM Blue Gene, Sun UltraSPARC V Network processors (Intel IXP, IBM PowerNP, Vitesse IQ2x00, Lextra,..) IBM RS64 IV: two-threaded block MT, reported 5% overhead Intel Xeon TM (hyperthreading): two-threaded SMT, reported 5% overhead Speedup Transistor Chip Space Increase Increase maximum model: 3 2% 9% small model: 1.5 9% 27%

24 Outline of the Presentation  Motivation  State-of-the-art  Multithreading Multithreading for throughput increase  Multithreading for power reduction Multithreading for embedded real-time systems  Conclusions & Research Opportunities

25 SMT for Reduction of Power Consumption  Observation: Mispredictions cost energy  Todays superscalars: ~ 60% of the fetched and ~ 30% of the executed instructions are squashed  Idea: fill issue slots by less speculative instructions of other threads  Simulations of Seng et al. 2000 show that ~ 22% less energy is consumed by using a power-aware scheduler

26 Outline of the Presentation  Motivation  State-of-the-art  Multithreading Multithreading for throughput increase Multithreading for power reduction  Multithreading for embedded real-time systems  Conclusions & Research Opportunities

27 Multithreading in Embedded Real-time Systems – The Komodo Approach  Observation: multithreading allows a context switching overhead of zero cycles  Idea: harness multithreading for embedded real-time systems è Komodo Project: Real-time Java Based on a Multithreaded Java-microcontroller http://www.informatik.uni-augsburg.de/lehrstuehle/info3/research/ komodo/indexEng.html

28 Real-time Requirements run-time predictability isolation of the threads programmability real-time scheduling support fast context switching Hard real-time: a deadline may never be missed Soft real-time: a deadline may occasionally be missed

29 Komodo Solutions  Extremely fast context switching by hardware multithreading  Real-time scheduling in hardware  Based on a Java processor core  Predictability of all instruction executions by a careful hardware design

30 Komodo Microcontroller Pipeline

31 Komodo Microcontroller Design

32 Hardware Real-time Scheduling  Real-time scheduler is realized in hardware (by the priority manager)  Scheduling decision every clock cycle  Four different scheduling algorithms implemented: Fixed Priority Preemptive (FPP) Earliest Deadline First (EDF) Least Laxity First (LLF) Guaranteed Percentage (GP)

33 Guaranteed Percentage Scheme ional processor on a multithreaded processor context switch surplus tionviola

34 Simulation Results thread mix (IC, PID, and FFT) applied

35 Technical Data of the Komodo Prototype q Implementation of Komodo core pipeline on a Xilinx XCV800 with 800k gates  ASIC synthesis of whole microcontroller (0.18  m technology): 340 MHz, 3 mm 2 chip data bit width address space number of threads instruction window size stack size external frequency internal frequency CLBs number of gates 32 bit 19 bit 4 8 bytes 128 entries 33 MHz 8.25 MHz 9 200 133 000

36 Chip-Space of Komodo Core Pipeline

37 Reducing Power Consumption Using Real-time Scheduling in Hardware Current work: Idea: Use information about the thread states and configurations available within the priority manager for a „fine-grained“ adaption of power consumption and performance.  Frequency and voltage adjustments in short time intervals done by hardware

38 State of the Komodo Project Software simulator FPGA prototyp Real-time Java system - ASIC - Middleware for distributed embedded systems

39 Conclusions on Multithreading in Real-time Environments Multithreaded processor cores: Performance gain due to fast context switching (for hard real- time) and latency hiding (for soft and non real-time) More efficient event handling by ISTs Helper threads possible (garbage collection, debugging) Real-time scheduling in hardware: Software overhead for real-time scheduling removed more efficient power saving mechanisms possible better predictablility by isolation of threads (GP scheduling)

40 Conclusions & Research Opportunities  Multithreading proves advantageous: Latency hiding: speed-ups of 2-3 for SMT, lots of research done, next generation of microprocessors Power reduction: 22% savings reported, not much research up to now Fast context switching utilized by microcontroller for real-time systems, not much research up to now  Research opportunities: Scheduling in SMT, network processors and multithreaded real-time systems Thread-speculation: how to speed-up single-threaded programs? Multithreading and power consumption Multithreading in other communities: microcontrollers, SoCs System software based on helper threads

41 Acknowledgements  SMT Multimedia research group Uli Sigmund and Heiko Oehring  Complexity estimation group Marc Steinhaus, Reiner Kolla, Josep L. Larriba-Pey, Mateo Valero  Komodo project group Jochen Kreuzinger, Matthias Pfeffer, Sascha Uhrig, Uwe Brinkschulte, Florentin Picioroaga, Etienne Schneider

42 Mikroprozessors: Technology Prognosis up to 2012  SIA (semiconductor industries association) Prognose 1997:

43 Research Directions?  Increase performance of a single thread of control by more instruction-level speculation -Better branch prediction, -Trace cache and next trace prediction, -Data dependence and value prediction  Increase throughput of a workload of multiple threads Utilize thread-level and instruction-level parallelism -Chip-Multiprocessors -Multithreading (hardware thread = thread or process)  Thread speculation

Theo Ungerer Systems and Networking University of Augsburg Opportunities.

Similar presentations

Presentation on theme: "Theo Ungerer Systems and Networking University of Augsburg Opportunities."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Theo Ungerer Systems and Networking University of Augsburg Opportunities.

Similar presentations

Presentation on theme: "Theo Ungerer Systems and Networking University of Augsburg Opportunities."— Presentation transcript:

Similar presentations

About project

Feedback