Exploiting Dynamic Phase Distance Mapping for Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Slides:



Advertisements
Similar presentations
1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter.
Project Proposal Presented by Michael Kazecki. Outline Background –Algorithms Goals Ideas Proposal –Introduction –Motivation –Implementation.
Dynamic Scheduling for Reduced Energy in Configuration-Subsetted Heterogeneous Multicore Systems + Also Affiliated with NSF Center for High- Performance.
Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.
Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
Performance and Energy Bounds for Multimedia Applications on Dual-processor Power-aware SoC Platforms Weng-Fai WONG 黄荣辉 Dept. of Computer Science National.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
A One-Shot Configurable- Cache Tuner for Improved Energy and Performance Ann Gordon-Ross 1, Pablo Viana 2, Frank Vahid 1, Walid Najjar 1, and Edna Barros.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
1 Hot Caches, Cool Techniques: Online Tuning of Highly Configurable Caches for Reduced Energy Consumption Ann Gordon-Ross Department of Computer Science.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
1 of 21 Online Algorithms for Wireless Sensor Networks Dynamic Optimization Arslan Munir 1, Ann Gordon-Ross 2+, Susan Lysecky 3, and Roman Lysecky 3 1.
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Thermal-aware Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported.
Minimum Effort Design Space Subsetting for Configurable Caches + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
1 of 20 Low Power and Dynamic Optimization Techniques for Power-Constrained Domains Ann Gordon-Ross Department of Electrical and Computer Engineering University.
Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Department of Electrical & Computer Engineering
Anne Pratoomtong ECE734, Spring2002
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Tosiron Adegbija and Ann Gordon-Ross+
Ann Gordon-Ross and Frank Vahid*
Phase Capture and Prediction with Applications
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Tosiron Adegbija and Ann Gordon-Ross+
A Self-Tuning Configurable Cache
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Automatic Tuning of Two-Level Caches to Embedded Applications
Phase based adaptive Branch predictor: Seeing the forest for the trees
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Exploiting Dynamic Phase Distance Mapping for Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported by National Science Foundation (NSF) grant CNS Tosiron Adegbija and Ann Gordon-Ross + Department of Electrical and Computer Engineering University of Florida, Gainesville, Florida, USA

Introduction and Motivation Embedded systems are pervasive and have stringent design constraints –Constraints: Energy, size, real time, cost, etc. System optimization is challenging due to numerous tunable parameters –Tunable parameters: parameters that can be changed E.g., cache size, associativity, line size, clock frequency, etc. –Many combinations  large design space –Multicore systems result in exponential increase in design space –Tradeoff design constraints Design constraints (e.g., energy/performance) Many Pareto optimal designs 2 Energy Execution time Pareto optimal designs What is the best design?

Parameter Tuning Parameter tuning determines appropriate parameter values –Appropriate parameter values meet optimization goals and satisfy design constraints –Different applications have different parameter value requirements Parameter values must match the application’s requirements –Wastes up to 62% average energy (Gordon-Ross ‘05) if inappropriate –Parameter tuning specializes tunable parameters to the changing application requirements to meet optimization goals (e.g., lowest energy, best performance) –Configuration: specific combination of parameter values Best configuration: configuration that most closely meets optimization goals Parameter tuning is challenging –Large design space –Tuning overhead: energy/time consumed during tuning Our contribution –Simplify parameter tuning, reduce tuning overhead 3

4 Parameter Tuning – Cache Tuning Caches account for up to 50% of embedded system’s power. –Caches are a good candidate for energy optimization Requires configurable cache 2KB 8 KB, 4-way base cache 2KB 8 KB, 2-way 2KB 8 KB, direct- mapped Way concatenation 2KB 4 KB, 2-way 2KB 2 KB, direct- mapped Way shutdown Configurable Line size 16 byte physical line size A Highly Configurable Cache (Zhang ‘03) Tunable Asociativity Tunable Size Tunable Line Size

5 Dynamic Tuning Energy Executing in base configuration Tunable cache Tuning hardware TC Cache Tuning TC Download application Microprocessor Cache energy savings of 62% on average! (Gordon-Ross ‘05) Dynamic tuning changes system parameter values at runtime –Dynamically determines best configuration with respect to optimization goals (e.g., lowest energy/best performance) –Requires tunable hardware and tuning hardware (tuner) Tuner evaluates different configurations to determine the best configuration Lowest energy Execution time

6 Dynamic Tuning Advantages –Adapts to the runtime operating environment and input stimuli –Specializes configurations to executing applications Disadvantages –Large design space Challenging to determine the best configuration –Runtime tuning overhead (e.g., energy, power, performance) Energy consumed and additional execution time incurred during tuning Tunabl e cache Tuning hardware TC Download application Microprocessor

7 Phase-based Tuning Time varying behavior for IPC, level one data cache hits, branch predictor hits, and power consumption for SPEC2000 gcc (using the integrate input set) Base energy Application-tuned Time Energy Consumption Phase-tuned Change configuration Greater savings when tuning is phase- based, rather than application-based Applications have dynamic requirements during execution –Different phases of execution Tune when the phase changes, rather than when the application changes –Phase = a length of execution where application characteristics are relatively stable Characteristics: cache misses, branch mispredictions, etc

Phase-based Tuning Phase classification 8 Application x Conf1Conf2 Conf3 Conf4Conf5 Profiler How do we determine the best configuration for the different phases? Fixed length intervalsVariable length intervals

Previous Tuning Methods 9 Possible Cache Configurations Energy Heuristic method Execution time Energy Execution time Energy Possible Cache Configurations Energy Analytical method Execution time Energy Possible Cache Configurations Energy Exhaustive method A lot of tuning overhead Less tuning overhead No tuning overhead but computationally complex/not dynamic Previous work proposed Phase Distance Mapping (PDM): a low-overhead and dynamic analytical method.

10 Base phasePhase P i Cache characteristics Cache configuration PbPb CbCb Cache characteristics Cache configuration PiPi New phase d (P b, P i ) Phase distance Configuration distance CiCi ?? Phase Distance Mapping (PDM) Previously characterized phase Distance windows Phase distance ranges Configuration change from C b

PDM’s limitations 11 Phase Distance Mapping (PDM) Distance windows Statically defined Design time overhead!!! Needed to know/analyze applications a priori Unknown applications/ general purpose systems? Base phase Most prominent application domain

Phase Tuning Architecture 12 Main Memory Processor core 1 Data cache Instruction cache L1 Processor core 2 Data cache Instruction cache L1 Phase history table PDM module Distance window table Phase classification module Tuner Phase characterization hardware On-chip components Groups similar intervals into phases Changes the tunable hardware and evaluates each configuration Stores phase characteristics and configurations Determines a new phase’s best configuration Phase tuning architecture consists of phase characterization hardware –Phase characterization determines phase’s best configuration –Phase characterization hardware orchestrates PDM

Contributions We present DynaPDM: dynamic phase distance mapping –Alleviates design time effort Shorter time-to-market –Maximizes energy delay product (EDP) savings Defines distance windows during runtime Dynamically designates base phase and calculates configuration distances Specializes distance windows to dynamic system and application behavior –Low overhead, dynamic method for determining phase’s best configuration DynaPDM evaluation –DynaPDM effectively determines best configuration with minimal designer effort –Applicable to general purpose embedded systems with disparate unknown applications (e.g., smartphones, tablets, etc.) DynaPDM compared to PDM –Quantify DynaPDM’s EDP improvement over PDM Achieves higher EDP savings than PDM without extensive a priori application analysis 13

DynaPDM: Dynamic Phase Distance Mapping 14

Phase Characterization 15 Phase classification Phases/ phase characteristics Phase P 1 executed PhaseConfiguration Phase history table C P1 ? x x DynaPDM New phase, P 1 P 1 configuration, C P1 P1P1 P2P2 P3P3 P4P4 x x Base phase, P b PbPb P1P1 P2P2 P3P3 P4P4 C Pb x P1P1 PbPb DynaPDM is part of phase characterization –Determines a phase, P i ’s best configuration C P1 Used for comparison. Best configuration determined a priori or at runtime.

16 DynaPDM New phase, P 1 Distance window (DW)Configuration distance (CD) DW1CD1 DW2CD2 d (P b, P 1 ) Calculate ConfigP i Distance window table PhaseConfiguration Phase history table P1P1 ConfigP 1 Execute P 1 in ConfigP 1 Create new DW Initialize DW DW3 CD3

DynaPDM Creating new distance windows –Windows defined by upper bound Win U and lower bound Win L Phase distance D maps to a distance window if: Win L < D < Win U Distance window size S d = Win U – Win L –Empirically determined S d = 0.5 as generally suitable for embedded processor applications If D < S d, Win L = 0, and Win U = S d –WinU max determines maximum number of new distance windows If D > WinU max, D maps to WinU max < D < ∞ Maximum number of distance windows = WinU max /S d –If S d < D < WinU max, Win L = x | x ≤ D, x mod S d = 0, x + S d > D 17

DynaPDM Initializing distance windows (DW) 18 New DW? Set most recently used configuration as first phase’s initial configuration Tune cache parameters Execution j ≤ n? Tune cache size Tune associativity Tune line size Continue tuning Stop tuning; Store lowest EDP configuration No Yes Power-of-two increments until EDP increases Defines DW’s configuration distance  n allows limited number of phase executions to hone configurations closer to the optimal  trades off improved configuration efficacy for tuning overhead  depends on system’s application/phase persistence

Experimental Results 19

Experimental Setup Experiments modeled closely with PDM Design space –Level 1 (L1) instruction and data caches: cache size (2kB  8kB); line size (16B  64B); associativity (direct-mapped  4-way) –Base cache configuration: L1 instruction and data caches: cache size (8kB); line size (64B); associativity (4-way) 16 workloads from EEMBC Multibench Suite –Image processing, MD5 checksum, networking, Huffman decoding –Each workload represented a phase Simulations –GEM5 generated cache miss rates –McPAT calculated power consumption Energy delay product (EDP) as evaluation metric = system_power * (total_phase_cycles/system_frequency) 2 20

Workloads Optimal configurations determined by exhaustive search –Used to compare DynaPDM and PDM results Perl scripts implemented DynaPDM and PDM Experimental Setup Application TypePhasesNumber of phases Image processing1, 5, 13, 14, 15, 166 Networking2, 3, 4, 8, 95 MD5 Checksum10, 11, 123 Huffman Decoding71 Empty61 21 –Distribution of application phases used in experimental setup

Results -Used rotate-16x4Ms32w8 as base phase -EDP savings calculated with respect to base configuration -DynaPDM achieved 28% average EDP savings overall -Savings as high as 47% for 64M-rotatew2 -On average, within 1% of optimal -EDP improved over PDM by 8% 22 Base phase 28% DynaPDM improved over PDM and eliminated design-time effort! 47% DynaPDM’s configurations 1% of the optimal!

Results -Effects of n = 3 on EDP normalized to optimal cache configurations -n = maximum number of phase executions 23 DynaPDM determined optimal configurations Number of phase executions Determined optimal configurations in ≤ 3 executions!

Results 24 -Tradeoffs of S d with percentage of phases tuned at runtime -Tradeoffs of S d with percentage EDP savings Larger S d Number of distance windows/ tuned phases/ tuning overhead Smaller S d Number of distance windows/ tuned phases/ tuning overhead Larger S d EDP savings S d = 0.5 Balance between tuning overhead and EDP savings

Conclusions DynaPDM: Dynamic Phase Distance Mapping –Correlates known phase’s characteristics and best configuration with a new phase’s characteristics to determine new phase’s best configuration Reduces tuning overhead –Compared DynaPDM with most relevant prior work: PDM PDM required a priori knowledge of applications and extensive design-time effort –Not suitable for large or general purpose systems DynaPDM eliminated design-time effort –DynaPDM achieved EDP savings of 28% and determined configurations within 1% of the optimal: 8% improvement over PDM Future work –Evaluate DynaPDM’s scalability to many-core systems –Explore more complex systems with additional tunable parameters 25

Questions? 26