Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric.

Slides:



Advertisements
Similar presentations
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Advertisements

DSPs Vs General Purpose Microprocessors
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
G30™ A 3D graphics accelerator for mobile devices Petri Nordlund CTO, Bitboys Oy.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
Tools for Investigating Graphics System Performance
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
COOL Chips IV A High Performance 3D Graphics Rasterizer with Effective Memory Structure Woo-Chan Park, Kil-Whan Lee*, Seung-Gi Lee, Moon-Hee Choi, Won-Jong.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Digital Video and Multimedia If images can portray a powerful message then video (as a series of related images) is a serious consideration for any multimedia.
Enhancing GPU for Scientific Computing Some thoughts.
Real-Time HD Harmonic Inc. Real Time, Single Chip High Definition Video Encoder! December 22, 2004.
1Computer Graphics Implementation III Lecture 17 John Shearer Culture Lab – space 2
Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
MULTIMEDIA INPUT / OUTPUT TECHNOLOGIES
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Framebuffer Compression Using Dynamic Color Palettes (DCP) Ayub A. Gubran Tor M. Aamodt.
PERFORMANCE ENHANCEMENT IN ASP.NET By Hassan Tariq Session #1.
Browser Wars (Click on the logo to see the performance)
OPTIMISING USER INTERFACES FOR MOBILE COMPUTER ASSISTED INTERVIEWING DEVICES Prepared By Otyek Ronald James Uganda Bureau of Statistics Presented at the.
1 Isolating Web Programs in Modern Browser Architectures CS6204: Cloud Environment Spring 2011.
Igor Jánoš. Goal of This Project Decode and process a full-HD video clip using only software resources Dimension – 1920 x 1080 pixels.
Threads. Readings r Silberschatz et al : Chapter 4.
Computer Graphics I, Fall 2010 Scan conversion algorithms.
Computer Architecture. Instruction Set “The collection of different instructions that the processor can execute it”. Usually represented by assembly codes,
A Case for Toggle-Aware Compression for GPU Systems
Computer Architecture Lecture 12: Virtual Memory I
CSCI206 - Computer Organization & Programming
Lecture 2: Performance Today’s topics:
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Seth Pugsley, Jeffrey Jestes,
Lecture 3: MIPS Instruction Set
Microarchitecture.
Visit for more Learning Resources
What is GPU? how does it work?
Assistant Professor, EECS Department
SECTIONS 1-7 By Astha Chawla
Implementation III.
Graphics Processing Unit
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
CSCI206 - Computer Organization & Programming
Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1 Zülal.
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Rasterization and Antialiasing
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1, Zülal.
Reducing DRAM Latency via
* From AMD 1996 Publication #18522 Revision E
Rasterization and Antialiasing
Introduction to Heterogeneous Parallel Computing
Lecture 3: MIPS Instruction Set
CPU Scheduling David Ferry CSCI 3500 – Operating Systems
RADEON™ 9700 Architecture and 3D Performance
What Choices Make A Killer Video Processor Architecture?
Processing Data Where It Makes Sense Enabling Practical In- or Near Memory Computation Onur Mutlu 23.
CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators
Funded by the Horizon 2020 Framework Programme of the European Union
Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F
Intelligent Architectures for Intelligent Machines
Computer Architecture Lecture 30: In-memory Processing
Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation Onur Mutlu
BitMAC: An In-Memory Accelerator for Bitvector-Based
CPU Scheduling David Ferry CSCI 3500 – Operating Systems
Presentation transcript:

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu Google Consumer Workloads Data Movement Cost Goals We look at widely-used Google workloads to identify major sources of energy consumption: 1 1st key observation: 62.7% of total system energy is spent on data movement Understand the data movement related bottlenecks in modern consumer workloads Data Movement SoC DRAM L2 L1 CPU Analyze the benefits of processing-in-memory (PIM) to mitigate data movement cost 2 Chrome TensorFlow Mobile Compute Unit Google’s web browser Google’s machine learning framework Processing-in-Memory (PIM) Investigate the PIM logic that can maximize energy efficiency given the limited area and energy budget in consumer devices 3 2nd key observation: a significant fraction of data movement often comes from simple functions Video Playback and Video Capture Google’s video codec Browser Scrolling Tab Switching Texture Tiling Chrome employs a multi-process architecture Chrome Rendering Pipeline Rasterization Invoke Compositing idle high Conversion Texture Tiles time CPU Memory Read Bitmap Write Back PIM CPU + PIM CPU-Only Linear Bitmap Texture Tiling data movement Chrome Process … Render Process (tab 1) Render Process (tab 2) Render Process (tab n) HTML CSS HTML Parser CSS Parser Render Tree Layout Rasteriza-tion Composit-ing Texture Tiling Color Blitting Main operations during tab switching: Context switch Load the new page Chrome uses compression to reduce each tab’s memory footprint Google Docs Gmail Google Calendar Word- Press Twitter Ani- mation AVG CPU DRAM ZRAM Compressed Tab Texture tiling is a good candidate for PIM execution Requires simple primitives: memcopy, bitwise operations, and simple arithmetic operations PIM Core PIM Accelerator Texture Tiling Linear Bitmap Tiled Texture 9.4% of the area available for PIM logic 7.1% of the area available for PIM logic PIM core and PIM accelerator are feasible to implement in-memory texture tiling To study data movement during tab switching, we perform an experiment: A user opens 50 tabs (most-accessed websites) Scrolls through each for a few second Switches to the next tab 41.9% of page scrolling energy is spent on texture tiling and color blitting Compression and decompression contribute to 18.1% of the total system energy 1 2 19.6 GB of data move between CPU and ZRAM CPU CPU-Only Memory CPU + PIM PIM time Swap out N pages Read N Pages Compress Other tasks Write back compression ZRAM No off-chip data movement data movement high Uncompressed Pages Color Blitting Generates a large amount of data movement Accounts for 19.1% of the total system energy during scrolling Color blitting is a good candidate for PIM execution Requires low-cost operations: Memset, simple arithmetic, and shift operations 49.1% of total data movement comes from texture tiling and color blitting 77% of total energy consumption goes to data movement Both functions can benefit from PIM execution and can be implemented at PIM logic Color blitting is feasible to implement in PIM logic TensorFlow Mobile Video Playback and Capture Evaluation Inference Prediction Compressed video VP9 Decoder Display 63.5% of the system energy is spent on data movement 57.3% of the inference energy is spent on data movement 80.4% of the data movement energy comes from sub-pixel interpolation and deblocking filter 54.4% of the data movement energy comes from packing/unpacking and quantization Sub-pixel interpolation: interpolates the value of pixels at non-integer location Deblocking filter: a simple low-pass filter that attempts to remove discontinuity in pixels On average, energy consumption reduces by 49.1% using PIM core and 55.4% using PIM accelerator Reorders elements of matrices to minimize cache misses Up to 40% of the inference energy and 31% of inference execution time Packing’s data movement accounts for up to 35.3% of the inference energy Packing Matrix Packed Matrix A simple data reorganization process that requires simple arithmetic Converts 32-bit floating point to 8-bit integers Up to 16.8% of the inference energy and 16.1% of inference execution time Majority of quantization energy comes from data movement Quantization floating point integer A simple data conversion operation that requires shift, addition, and multiplication Chrome Browser Video Playback and Capture TensorFlow Mobile Captured video VP9 Encoder Compressed 59.1% of the system energy is spent on data movement Majority of the data movement energy comes from motion estimation On average, energy consumption reduces by 44.6% using PIM core and 54.2% using PIM accelerator Motion estimation: compresses the frames using temporal redundancy between them