Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.

Slides:

Advertisements

Similar presentations

Feedback Control Real- time Scheduling James Yang, Hehe Li, Xinguang Sheng CIS 642, Spring 2001 Professor Insup Lee.

Advertisements

Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms Chenyang Lu, John A. Stankovic, Gang Tao, Sang H. Son Presented by Josh Carl.

Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.

Introductory Control Theory I400/B659: Intelligent robotics Kris Hauser.

Adaptive QoS Control Based on Benefit Optimization for Video Servers Providing Differential Services Ing-Ray Chen, Sheng-Yun Li, I-Ling Yen Presented by.

Tunable Sensors for Process-Aware Voltage Scaling

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Yefu Wang and Kai Ma. Project Goals and Assumptions Control power consumption of multi-core CPU by CPU frequency scaling Assumptions: Each core can be.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

LDU Parametrized Discrete-Time Multivariable MRAC and Application to A Web Cache System Ying Lu, Gang Tao and Tarek Abdelzaher University of Virginia.

Process Control Instrumentation II

ECE 510 Brendan Crowley Paper Review October 31, 2006.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Adaptive Control of Virtualized Resources in Utility Computing Environments HP Labs: Xiaoyun Zhu, Mustafa Uysal, Zhikui Wang, Sharad Singhal University.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

ECE Power Control for Chip Multiprocessors Xue Li Oct 27, 2009.

Dynamic Cache Clustering for Chip Multiprocessors

1 A Feedback Control Architecture and Design Methodology for Service Delay Guarantees in Web Servers Presentation by Amitayu Das.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 ECE692 Topic Presentation Power/thermal-Aware Utilization Control Xing Fu 22 September 2009.

An Analysis of Efficient Multi-Core Global Power Management Policies Authors: Canturk Isci†, Alper Buyuktosunoglu†, Chen-Yong Cher†, Pradip Bose† and Margaret.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Automated Control in Cloud Computing: Challenges and Opportunities Harold C. Lim, Shivnath Babu, Jeffrey S. Chase, and Sujay S. Parekh ACM’s First Workshop.

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

University of Virginia PID Controllers Jack Stankovic University of Virginia Spring 2015.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Present by Sheng Cai Coordinating Power Control and Performance Management for Virtualized Server Clusters.

Clock Simulation Jenn Transue, Tim Murphy, and Jacob Medinilla 1.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

By Islam Atta Supervised by Dr. Ihab Talkhan

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

ECE692 Course Project Proposal Cache-aware power management for multi-core real-time systems Xing Fu Khairul Kabir 16 September 2009.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

ECE 692 Power-Aware Computer Systems Final Review Prof. Xiaorui Wang.

Coordinated Performance and Power Management Yefu Wang.

SKEE 3143 Control Systems Design Chapter 2 – PID Controllers Design

Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Xiaodong Wang, Shuang Chen, Jeff Setter,

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Shanjiang Tang1, Bingsheng He2, Shuhao Zhang2,4, Zhaojie Niu3

CARP: Compression-Aware Replacement Policies

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

Automatic Tuning of Two-Level Caches to Embedded Applications

EdgeWise: A Better Stream Processing Engine for the Edge

Presentation transcript:

Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma

2 Applying Control Theory to the Caches of Multiprocessors  Shared L2 cache is one of the most important on-chip shared resource.  Largest area and leakage power consumer  One of the dominant players in terms of performance  Two Papers:  Relative Cache Latency Control for Performance Differentiations in Power-Constrained Chip Multiprocessors  SHARP Control: Controlled Shared Cache Management in Chip Multiprocessors

Relative Cache Latency Control for Performance Differentiations in Power- Constrained Chip Multiprocessors Department of EECS University of Tennessee, Knoxville Xiaorui Wang, Kai Ma, Yefu Wang

4 Background NUCA (Non Uniform Cache Architecture) Key idea: Different cache banks have different access latencies. 13

5 Introduction  The power of the cache part needs to be constrained.  With controlled power, the performance of the caches also need to be guaranteed.  Why control relative latency (the ratio between the average cache access latencies of two threads)?  1. Accelerate critical threads 2. Reduce priority inversion

6 System Design Thread 1 on core 1 Thread 0 on core 0 Thread 3 on core 3 Latency Monitor Thread 2 on core 2 Relative Latency Controller Cache Resizing and Partitioning Modulator Power Monitor Power Controller Latency Monitor Relative Latency Controller Shared L2 Cache Relative Latency Control Loop Power Control Loop Cache bank of Thread 0 Cache bank of Thread 2 Cache bank of Thread 3 Cache bank of Thread 1 Inactive cache bank

7 Relative Latency Controller (RLC) New cache ratio RL RLC Relative latency set point PI (Proportional-Integral) controller  System modeling  Controller design  Control analysis 1.5 Error: 0.3 Increase 0.2  Workload variation  Total cache size variation 1.5 Shared L2 caches 1.2

8 Relative Latency Model  is the relative latency between and core  is the cache size ratio between and core  RL model  System identification  Model orders  Parameters Model Orders and Error

9 Controller Design  PID controller  Proportional  Integral  Design: Root Locus New cache ratio Relative latency Relative Latency set point Error Shared L2 caches

10 Control Analysis  Derive the transfer function of the controller  Derive the transfer function of the system with system model variations  Derive the transfer function of the close-loop system and compute the poles The control period of the power control loop is selected to be longer than the settling time of the relative latency control loop. Stability range:

11 Power Controller  is the total cache size in the power control period.  is the cache power in the power control period.  are the parameters depended on applications  System Model  Leakage power is proportional to the cache size.  Leakage power counts for the largest portion of cache power.  PI Controller  Controller analysis: and

12 Simulation  Simulator  Simplescalar with NUCA cache (Alpha like core)  Power reading  Dynamic part: Wattch (with CACTI)  Leakage part: Hotleakage  Workload  Selected workloads from SPEC2000  Actuator  Cache bank resizing and partitioning

13 Single Control Evaluation Switch workloads here RLC set point changePower controller set point change Workload switchTotal cache bank count change

14 Relative Latency & IPC

15 Coordination Cache access latencies and IPC values of the four threads on the four cores of the CMP. Cache access latencies and IPC values of the two threads on Core 0 and Core 1 for different benchmarks.

16 Conclusions  Relative Cache Latency Control for Performance Differentiations in Power-Constrained Chip Multiprocessors  Simultaneously control power and relative latency  Achieve desired performance differentiations  Theoretically analyze the single loop control and coordinated system stability

SHARP Control: Controlled Shared Cache Management in Chip Multiprocessors Shekhar Srikantaiah, Mahmut Kandemir, *Qian Wang Department of CSE *Department of MNE The Pennsylvania State University

18 Introduction  Lack of control over shared on-chip resource  Faded performance isolation  Lack of Quality of Service (QoS) guarantee  It is challenging to achieve high utilization meanwhile guaranteeing the QoS.  Static/dynamic resource reservations may lead to low resource utilization.  Existing heuristics adjustment cannot provide theoretical guarantee like “settling time” or “stability range”.

19 Contribution  Two-layer control theory based SHARP (SHAred Resource Partitioning) architecture  Propose an empirical model  Design a customized application controller (Reinforced Oscillation Resistant controller)  Study two policies can be used in SHARP  SD (Service Differentiation)  FSI (Fair Speedup Improvement)

20 System Design

21 Why not PID?  Disadvantages of PID (Proportional-Integral- Derivative) controller  Painstaking to tune the parameters  Hard to be integrated with hierarchical architecture  Sensitive to model variation during run time  Static parameters  Generic controller (not problem-specific)  Linear model based controller

22 Application Controller

23 Pre-Actuation Negotiator (PAN)  Map an overly demanded cache partition to a feasible partition  Policies:  SD (Service Differentiation )  FSI (Fair Speedup Improvement )

24 SHARP Controller  Increase IPC set points when cache ways are under utilized  FSI & SD policies  The proof of guaranteed optimal utilization

25 Experimental Setup  Simulator : Simics (Full system simulator)  Operating System: Solaris 10  Configuration (2, 8 cores)  Workload:  6 mixes of applications selected from SPEC2000

26 Evaluation (Application Controller) Long run results of PID controller and ROR controller

27 Evaluation (FSI) SHARP vs Baselines

28 Evaluation (SD) Adaptation of IPC with the SD policy using the ROR controllers.

29 Sensitivity & Scalability Sensitivity analysis for different reference points Scalability (8 cores)

30 Conclusion  SHARP Control: Controlled Shared Cache Management in Chip Multiprocessor  Propose and design the SHARP control architecture for shared L2 caches  Validate SHARP with different management policies (FSI or SD)  Achieve desired FS and SD specifications

31 Critiques (1)  How to decide the relative latency set point?  For accelerating critical thread purpose, the parallel workloads may be more applicable.

32 Critiques (2)  No stability proof  Insufficient description about how to update the parameters for the application controllers

33 Comparison Relative latency control with the power constraint SHARP control architecture GoalGuarantee NUCA L2 cache relative latency with different power budget Improve the normal L2 cache utilization while guaranteeing the QoS metrics DesignTwo-layer hierarchical design ControllerPIDROR Coordination & StabilityYesNo ActuatorCache bank resizing and partitioning Cache way resizing and partitioning EvaluationSimplescalarSimics

34 Q & A  Thank you

35 Backup Slides Start

36 Relative Controller Evaluation (2)

37 Application Controller Evaluation (2)

38 Guaranteed Optimal Utilization Proof  are time varying coefficient depended on applications

39 System Design