Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture.

Slides:

Advertisements

Similar presentations

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Advertisements

Distributed Systems CS

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

High Performing Cache Hierarchies for Server Workloads

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Challenges and Opportunities for System Software in the Multi-Core Era or The Sky is Falling, The Sky is Falling!

Chapter 8. Pipelining.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Charles Reiss *, Alexey Tumanov †, Gregory R. Ganger †, Randy H. Katz *, Michael A. Kozuch ‡ * UC Berkeley† CMU‡ Intel Labs.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

Memory System Characterization of Big Data Workloads

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Query Reordering for Photon Mapping Rohit Saboo. Photon Mapping A two step solution for global illumination: Step 2: Shoot eye rays and perform a “gather”

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Manycore Network Interfaces for In-Memory Rack-Scale Computing Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, Boris Grot.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Computer System Architectures Computer System Software

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.

Multi-core architectures. Single-core computer Single-core CPU chip.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Multi-Core Architectures

Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Web Search Using Mobile Cores Presented by: Luwa Matthews 0.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Srihari Makineni & Ravi Iyer Communications Technology Lab

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Sun Starfire: Extending the SMP Envelope Presented by Jen Miller 2/9/2004.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Assoc. Prof. Dr. Ahmet Turan ÖZCERİT.  What Operating Systems Do  Computer-System Organization  Computer-System Architecture  Operating-System Structure.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Sunpyo Hong, Hyesoon Kim

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Reducing OLTP Instruction Misses with Thread Migration

Framework For Exploring Interconnect Level Cache Coherency

Green cloud computing 2 Cs 595 Lecture 15.

ISPASS th April Santa Rosa, California

The University of Adelaide, School of Computer Science

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Lecture 18 Warehouse Scale Computing

Managing GPU Concurrency in Heterogeneous Architectures

Lecture: Cache Innovations, Virtual Memory

Chapter 8. Pipelining.

Lecture 18 Warehouse Scale Computing

Lecture 18 Warehouse Scale Computing

Two Threads Are Better Than One

Presentation transcript:

Warehouse-Scale Computing Mu Li, Kiryong Ha 10/17/ Computer Architecture

Overview Motivation – Explore architectural issues as the computing moves toward the could 1.Impact of sharing memory subsystem resources (LLC, memory bandwidth..) 2.Maximize resource utilization by co-locating applications without hurting QoS 3.Inefficiencies on traditional processors for running scale- out workloads

Overview PaperProblemApproach The Impact of Memory Subsystem Resource Sharing on Datacenter Applications Sharing in Memory subsystemSoftware Bubble-UpResource UtilizationSoftware Clearing the cloudInefficiencies for scale-out workload Software Scale-out processorsImprove scale-out workload performance Hardware

Impact of memory subsystem sharing

Motivation & Problem definition – Machines have multi-core, multi-socket – For better utilization, applications should share Last Level Cache(LLC) / Front Side Bus (FSB)  It is important to understand the memory sharing interaction between (datacenter) applications

Impact of thread-to-core mapping Sharing Cache Separate FSBs (XX..XX..) Sharing Cache Sharing FSBs (XXXX….) Separate Cache Separate FSBs (X.X.X.X.)

Impact of thread-to-core mapping - Performance varies up to 20% - Each Application has different trend. - TTC behavior changes depending on co-located application.

Observation 1.Performance can significantly swing simply based on how application threads are mapped to cores. 2.Best TTC mapping changes depends on co-located program. 3.Application characteristics that impact performance – Memory bus usage, Cache line sharing, Cache footprint – Ex) CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint  Works better if it doesn’t share LLC and FSB  STITCH use more Bus bandwidth, so co-located CONTENT ANALYZER will have contention on FSB

Increasing Utilization in Warehouse scale Computers via Co-location

Increasing Utilization via Co-location Motivation – Cloud computing wants to get higher resource utilization. – However, overprovisioning is used to ensure the performance isolation for latency-sensitive task, which lowers the utilization.  Need precise prediction in shared resource for better utilization without violating QoS.

Bubble-up Methodology 1.QoS sensitivity curve (  ) – Get the sensitivity of the application by iteratively increasing the amount of pressure to memory subsystem 2.Bubble score (  ) – Get amount of pressure that the application causes on a reporter

Better Utilization Now we know 1)how QoS changes depending on bubble size (QoS curve) 2)how the application can affect to others (bubble number)  Can co-locate applications estimatiing changes on QoS

Scale-out workload

Examples: – Data Severing – Mapreduce – Media Streaming – SAT Solver – Web Frontend – Web Search

Execution-time breakdown A major part of time is waiting for caches misses  A clear micro-architectural mismatch

Frontend ineffficiencies Cores idle due to high instruction-cache miss rates L2 caches increase average I-fetch latency Excessive LLC capacity leads to long I-fetch latency How to improve? – Bring instructions closer to the cores

Core inefficiencies Low instruction level parallelism precludes effectively using the full core width Low memory level parallelism underutilizes reorder buffers and load-store queues. How to improve? – Run many things together: multi-threaded multicore architecture

Data-access inefficiencies Large LLC consumes area, but does not improve performance Simple data prefetchers are ineffective How to improve? – Reduce LLC, leave place for processers

Bandwidth inefficiencies Lack of data sharing deprecates coherence and connectivity Off-chip bandwidth exceeds needs by an order of magnitude How to improve? – Scale back on-chip interconnect and off-chip memory bus to give place for processors

Scale-out processors So, too large LLC, interconnect, memory bus, but not enough processors Here comes a better one: Improve throughput by 5x-6.5x!

Q&A or Discussion

Supplement slides

Datacenter Applications Applicatio n DescriptionMetricType content analyzer Throughputlatency- sensitive bigtableaverage latencylatency- sensitive websearchqueries per second latency- sensitive stitcherBatch protobufBatch - Google’s production application

Key takeaways TTC behavior is mostly determined by – Memory bus usage (for FSB sharing) – Data sharing: Cache line sharing – Cache footprint: Use last level cache miss to estimate the foot print size Example – CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint  Works better if it does not share LLC and FSB – Stich actually uses more Bus bandwidth, so it’s better for CONTENT ANALYZER not to share FSB with stitch

1% prediction error on average Prediction accuracy for pairwise co-locations of Google applications