Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,

Slides:

Advertisements

Similar presentations

Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

SuperRange: Wide Operational Range Power Delivery Design for both STV and NTV Computing Xin He, Guihai Yan, Yinhe Han, Xiaowei Li Institute of Computing.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Paul D. Bryan, Jason A. Poovey, Jesse G. Beu, Thomas M. Conte Georgia Institute of Technology.

Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.

Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/ Computer Architecture.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Fast Architecture Evaluation of Heterogeneous MPSoCs by Host-Compiled Simulation 黃翔 Dept. of Electrical Engineering National Cheng Kung University Tainan,

SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.

An Analysis of Efficient Multi-Core Global Power Management Policies Authors: Canturk Isci†, Alper Buyuktosunoglu†, Chen-Yong Cher†, Pradip Bose† and Margaret.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

CISC 879 : Advanced Parallel Programming Vaibhav Naidu Dept. of Computer & Information Sciences University of Delaware Importance of Single-core in Multicore.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.

Classic Model of Parallel Processing

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Ning WengANCS 2005 Design Considerations for Network Processors Operating Systems Tilman Wolf 1, Ning Weng 2 and Chia-Hui Tai 1 1 University of Massachusetts.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.

Sunpyo Hong, Hyesoon Kim

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Resource Specification Prediction Model Richard Huang joint work with Henri Casanova and Andrew Chien.

R-Storm: Resource Aware Scheduling in Storm

Adaptive Cache Partitioning on a Composite Core

Ching-Chi Lin Institute of Information Science, Academia Sinica

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

Department of Computer Science University of California, Santa Barbara

Haishan Zhu, Mattan Erez

Jianbo Dong, Lei Zhang, Yinhe Han, Ying Wang, and Xiaowei Li

Guihai Yan, Yinhe Han, Xiaowei Li, and Hui Liu

Presentation transcript:

Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan, Yinhe Han and Xiaowei Li State Key Laboratory of Computer Architecture Institute of Computing Technology, C.A.S. Univ. of Chinese Academy of Sciences

Trends in Cloud Computing  The increasing computing demands  More massive  More diverse  High service level agreement(response time, throughput)  The computing platform to meet these demands  Multicore to manycore  Homogeneous to heterogeneous

Two Orthogonal Ways to Boost Performance  Scale-out speedup: explore many cores for higher thread-level parallelism  Scale-up speedup: explore heterogeneous cores for optimal application-core mapping

Quantifying Scale-out and Scale-up Speedup  The overall performance Indicate how to improve overall performance of each application. How to figure out the application-specific scale-out and scale-up speedup?

Amphisbaena: an Analytical Approach to Model Performance  Amphisbaena, or shortly,  Modeling the overall performance speedup coming from two orthogonal ways I’m The ratio of performance on target cores to current cores under the same multithreading configuration. The ratio of performance on target multithreading configuration to current configuration on the same type of cores.

Experimental Setup cluster-based layout distributed, banked LLC directory-based MOESI protocol

Scale-out Speedup – the serial part. – the parallelizable part. – the multithreading penalty.

Observation – modulating constant. – synchronization waiting cycles per kilo- instructions(SPKI). – thread number. – modulating constant. – misses waiting cycles per kilo- instructions(MPKI). – thread number squared.

The Details of Multithreading Penalty offline online

Alpha Model Accuracy Our error is under 5% on average, which outperforms the error of Amdahl’s Law with error of 11.4%.

Scale-up Speedup the frontend: issue width W [Big, Small] the backend: ROB size R[Big, Small] How to predict the CPI on various type of cores? SBSB SBSB SBSB SBSB BBBB BBBB SSSS SSSS C0C1 C2 C3

Observation  this trend is well approximated by a power law.  this trend fits an exponential function well.

The Details of CPI Model  memory intensity.  computing intensity.  bias. offline online

Beta Model Accuracy Our error is kept below 8% on average, which outperforms the error of PIE with error of 12.2%.

Phi Model Accuracy The prediction error of overall performance is kept below 12% on average.

Orthogonality Validation  three measured values. For most applications, the error about orthogonality is below 5% on average.

Application of Phi Model  Using Phi for runtime management Predict the performance speedup coming from scale-out and scale-up on any other target configurations online. optimal configuration maximizing performance Invoke scheduling algorithm to figure out the optimal configuration in terms of maximizing performance. The operating system enables the specified multithreading and application-core mapping.

Phi Scheduling D out D up Phi “application with higher scale-out speedup should spawn more thread.” “application with largest scale-up speedup is allocated with the fastest type of cores.” “decide the thread number to spawn for each application.” “decide the cores to map for each application.” “Phi scheduling use the heuristic algorithm to maximize performance.” function policy algorithm

Performance Comparison Phi averagely outperforms the other three baselines by 12.2% (Static), 13.3% (Bias) and 12.9% (PIE).

Related Works  Performance prediction and optimization periodically  Only decided the number of threads/active cores CPR: Composable Performance Regression for Scalable Multiprocessor – [Benjamin C. Lee etc. MICRO2008] FDT: Feedback-Driven Threading Power-Efficient and High- Performance Execution of Multi-threaded Workloads on CMPs – [M. Aater Suleman etc. ASPLOS2008]  Only decided the type of heterogeneous cores Single-ISA Heterogeneous Multi-core Architectures for Multithreaded Workload Performance – [Rakesh Kumar etc. ISCA2004] Scheduling Heterogeneous Multi-cores Through Performance Impact Estimation (PIE) – [Kenzo Van Craeynest etc. ISCA2012]

Conclusion  Analytical model for performance prediction  Scale-out speedup  Scale-up speedup  Overall performance  Phi scheduling  Apply for runtime management  Return optimal performance

Thanks for Your Attention