Warm-Up Methodology for HW/SW Co-Designed Processors A. Brankovic, K. Stavrou, E. Gibert, A. Gonzalez.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Set Up Instructions Place a question in each spot indicated Place an answer in each spot indicated Remove this slide Save as a powerpoint slide show.
Greening Backbone Networks Shutting Off Cables in Bundled Links Will Fisher, Martin Suchara, and Jennifer Rexford Princeton University.
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
ARIN Public Policy Meeting
1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $
Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Fast optimal instruction scheduling for single-issue processors with arbitrary latencies Peter van Beek, University of Waterloo Kent Wilken, University.
Who Wants To Be A Millionaire? Decimal Edition Question 1.
ZMQS ZMQS
TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
1 Lecture 2: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation, Summarizing performance with AM, GM, HM Video 1: Using AM.
From A to E: Analyzing TPCs OLTP Benchmarks Pınar Tözün Ippokratis Pandis* Cansu Kaynak Djordje Jevdjic Anastasia Ailamaki École Polytechnique Fédérale.
SE 292 (3:0) High Performance Computing L2: Basic Computer Organization R. Govindarajan
Chapter 4 Memory Management Basic memory management Swapping
Compressing Forwarding Tables Ori Rottenstreich (Technion, Israel) Joint work with Marat Radan, Yuval Cassuto, Isaac Keslassy (Technion, Israel) Carmi.
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
Cache and Virtual Memory Replacement Algorithms
Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.
A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.
CRUISE: Cache Replacement and Utility-Aware Scheduling
Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.
Bypass and Insertion Algorithms for Exclusive Last-level Caches
1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Squares and Square Root WALK. Solve each problem REVIEW:
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.
Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.
This, that, these, those Number your paper from 1-10.
Addition 1’s to 20.
25 seconds left…...
Test B, 100 Subtraction Facts
Week 1.
Number bonds to 10,
We will resume in: 25 Minutes.
1 Unit 1 Kinematics Chapter 1 Day
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
12-Apr-15 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
Scalable Rule Management for Data Centers Masoud Moshref, Minlan Yu, Abhishek Sharma, Ramesh Govindan 4/3/2013.
New Opportunities for Load Balancing in Network-Wide Intrusion Detection Systems Victor Heorhiadi, Michael K. Reiter, Vyas Sekar UNC Chapel Hill UNC Chapel.
T-SPaCS – A Two-Level Single-Pass Cache Simulation Methodology + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Wei Zang.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
JOP: A Java Optimized Processor for Embedded Real-Time Systems Martin Schöberl.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Floorplanning Optimization with Trajectory Piecewise-Linear Model for Pipelined Interconnects C. Long, L. J. Simonson, W. Liao and L. He EDA Lab, EE Dept.
Sunpyo Hong, Hyesoon Kim
??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
rePLay: A Hardware Framework for Dynamic Optimization
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Warm-Up Methodology for HW/SW Co-Designed Processors A. Brankovic, K. Stavrou, E. Gibert, A. Gonzalez

HW/SW co-designed processors (Transmeta-like processors) – Staged Compilation Interpretation, Basic Block Translation, Super-Block OptimizationScope 2 Transparent Optimization Layer Transparent Optimization Layer CODE CACHE I$ HW Guest ISA (CISC - x86) Host ISA (RISC) TOL

Q: How to warm-up the µA state of a HW/SW co-designed processors? 3Scope Requirements (1)LOW ERROR (2)LOW SIMULATION COST (3) GENERALY APPLICABLE (tolerant to different TOL and HW setups) Requirements (1)LOW ERROR (2)LOW SIMULATION COST (3) GENERALY APPLICABLE (tolerant to different TOL and HW setups) WARM-UP collect Warm-Up TOL state + Warm-Up HW state AUTHORITATIVE EXECUTION collect Correct TOL state + Correct HW state Restore Architectural Point L2$ miss ~ 100 cycles Translation & Optimization Overhead TOL Translation & Optimization Overhead TOL L2$ miss TOL ~ 10K cycles TOL ~ 10K cycles TOL overhead creates 2-3 orders of magnitude bigger simulation error TOL overhead creates 2-3 orders of magnitude bigger simulation error time

Contributions – We solved the problem of the warm-up in HW/SW co-designed processors – Our solution is based on Downscaling Promotion Thresholds (TH) Simulation Error: 0.75%, Simulation Cost Reduction: 65X Agenda – Proposed Solution: Downscaling Promotion Thresholds – Simulation methodology – Experimental Results – ConclusionsScope 4

Authoritative simulation collect Beginning of application HW warm-up (wu)TOL authoritative collect HW wuTOL wu collect (AS) Authoritative Simulation - very costly (FASS) Fast Authoritative Simulation - costly NAÏVE – bad error/cost trade-off Short warm-up -> 100% error Existing solutions HW wuTOL wu collect (CP) Compilation Plan - Not generally applicable Depends on Transparent Optimization Layer -TOL Correct TOL state (Transparent Optimization Layer) + Correct HW state X X X X Warm-Up HW state Warm-Up TOL state 5 Restore x86 State Restore x86 State + profiling info & compilation plan time More details in the paper

Downscaling Promotion Thresholds (TH) During the warm-up, the promotion thresholds are downscaled – in order to promote faster the code regions to the right compilation stage HW wuTOL wu collect Restore x86 State Low Error Low Simulation Cost Generally Applicable Original Threshold Warm-Up Length (L WU ) 6 time collect lower threshold(TH WU1 ) L1L2L3 TH WU2 (TH WU2 > TH WU1 ) L WU1 L WU2 (L WU2 > L WU1 ) >Threshold1>Threshold2 InterpretationSB optimizationBB translation Which pair is better ? (TH WU1,L WU1 ) or (TH WU2,L WU2 ) Which pair is better ? (TH WU1,L WU1 ) or (TH WU2,L WU2 )

Contributions – We solved the problem of the warm-up in HW/SW co-designed processors – Our solution is based on Downscaling Promotion Thresholds (TH) Simulation Error: 0.75%, Simulation Cost Reduction: 65X Agenda – Proposed Solution: Downscaling Promotion Thresholds – Simulation methodology – Experimental Results – ConclusionsOutline 7

Simulation Methodology 8 DARCO infrastructure 1 – Transparent Optimization Level - TOL (x86 -> PowerPC) Compilation stages: Interpretation, BB translation, SB optimization Threshold INT->BB =5, Threshold BB->SB =10K – Hardware (PowerPC-like) 2-way issue, in-order, 3 levels memory hierarchy Suites: SPECint2006 and SPECfp2006 Simulate first 5B x86 instructions – 5 samples per application (at 1B, 2B, 3B, 4B, 5B x86 instructions) HW collect TOL 10M 1M [1M,10M, 100M, 500M,1B] 1 DARCO: Infrastructure for Research on HW/SW co-designed Virtual Machines. In AMAS-BT'11,, San Jose, June 4, 2011.

gCPI (guest CPI) error is not enough to be measured Other statistics are important: Number of host instructions, SuperBlock coverage, etc. – In order to guide the researchers error = max(gCPI error, SuperBlock coverage error, #host instruction error) 9 Warm-Up Error Definition There are examples where: gCPI error - accurate SB coverage – inaccurate #host instructions – inaccurate There are examples where: gCPI error - accurate SB coverage – inaccurate #host instructions – inaccurate

Contributions – We solved the problem of the warm-up in HW/SW co-designed processors – Our solution is based on Downscaling Promotion Thresholds (TH) Simulation Error: 0.75%, Simulation Cost Reduction: 65X Agenda – Proposed Solution: Downscaling Promotion Thresholds – Simulation methodology – Experimental Results – ConclusionsOutline 10

Exploring 50 configurations (10 thresholds x 5 warm-up lengths) – TH WU : [10, 20, 50, 100, 200, 500,1000, 2000, 5000,10K] – L WU : [1M,10M, 100M, 500M,1B] TH ORACLE chooses off-line the best pair 11 Downscaling Promotion Thresholds (TH) - ORACLE LOW ERROR, LOW SIMULATION COST TH - ORACLE avg 0.4% Max 4.8% Cost Red.: 90X TH - ORACLE avg 0.4% Max 4.8% Cost Red.: 90X IDEAL SCENARIO (0% error) ( cost reduction) Relative error wrt authoritative simulation [%] Cost reduction wrt authoritative simulation [X] collect Original Threshold Warm-Up Length (L WU ) collect Lower Threshold (TH WU1 ) L WU1 TH WU

Downscaling Promotion Thresholds (TH) - Prediction Model Predict warm-up threshold and warm-up length based on high-level statistics Scaled warmed-up execution - similar behavior like authoritative execution – Based on execution distribution of PCs Algorithm: – record execution distribution exec(PCx) during WU1 - for each PC in collect – find scaling factor - scales the best warm-up to authoritative distribution scaling factor = TH DEF /TH WU – find the best warm-up period (WU1 or WU2) - scaling error is minimal WU1 is better than WU2 collect exec(PCx)=N1 exec(PCx)=N2 exec(PCx)=N 0 WU2 WU1 AUTHORITHATIVE PCs Exec. Counter PCx N N1 N2 AUTH. WU1 WU2 TH 12

Exploring 50 configurations (10 thresholds x 5 warm-up lengths) – TH WU : [10, 20, 50, 100, 200, 500,1000, 2000, 5000,10K] – L WU : [1M,10M, 100M, 500M,1B] TH ORACLE chose off-line the best pair TH MODEL when the algorithm is applied 13 Downscaling Promotion Thresholds (TH) – MODEL vs. ORACLE TH MODEL similar to TH ORACLE TH - MODEL avg 0.75% Max 16% Cost Red.: 65X TH - MODEL avg 0.75% Max 16% Cost Red.: 65X IDEAL SCENARIO (0% error) ( cost reduction) Relative error wrt authoritative simulation [%] Cost reduction wrt authoritative simulation [X]

Different Transparent Optimization Layers (TOL) configurations: – without Optimizations – without Linking Similar for different HW parameters: – L1D$: 2X smaller, 2X bigger – L2U$: 2X smaller, 2X bigger 14 Downscaling Promotion Thresholds (TH) - Different Configurations Model behaves similar for other TOLs

Contributions – We solved the problem of the warm-up in HW/SW co-designed processors – Our solution is based on Downscaling Promotion Thresholds (TH) Simulation Error: 0.75%, Simulation Cost Reduction: 65X Agenda – Proposed Solution: Downscaling Promotion Thresholds – Simulation methodology – Experimental Results – ConclusionsOutline 15

16Conclusions Conclusions – We proved that traditional warm-up cannot be applied for HW/SW co-designed processors More details in the paper – We showed that the error of the warm-up cannot be based on gCPI – We proposed the novel warm-up approach – Downscaling Promotion Thresholds – We proposed the model to predict the optimal warm-up setup for Downscaling Promotion Thresholds technique Average Simulation Error : 0.75% Average Simulation Cost Reduction: 65X

17Questions Thanks! ? Aleksandar Brankovic ? ? ? ? ? ? ? ? ?

18 Backup Slides

Compilation plan: save the plan of the region formation Issues: cases when the region formation depends on µA execution – Control speculation Transparent Optimization Layer (TOL): coverts biased braches into asserts Converting depends on how other regions are built in the code cache – Software controlled power gating – Software prefetcher 19 Compilation Plan - Issues BB1 SuperBlock BB2 BB3 Biased Branch? YES NO Code Cache contents 1 Code Cache contents 2

20 Results: CP (CP vs NAIVE) Reduction in error x!! Limitations. Depends on SW layers NAIVE. CP Average error CP Average error CP. Maximum error 20% CP. Maximum error 20% Average Relative Errors, collect period 10M

21 Results: FASS vs NAIVE NAIVE. SW error 0 HW error 0 FEASIBLE Maximum error high NAIVE. SW error 0 HW error 0 FEASIBLE Maximum error high FASS SW error = 0 HW error 0 NOT FEASIBLE FASS SW error = 0 HW error 0 NOT FEASIBLE New technique needed! HWTOLcoll HWTOLcoll FASS NAIVE 0end