Andreas Sandberg ARM Research

Slides:



Advertisements
Similar presentations
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Advertisements

JUST-IN-TIME COMPILATION
© 2007 Eaton Corporation. All rights reserved. LabVIEW State Machine Architectures Presented By Scott Sirrine Eaton Corporation.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Chapter 10- Instruction set architectures
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Basic Input Output System
COMP3221: Microprocessors and Embedded Systems Lecture 15: Interrupts I Lecturer: Hui Wu Session 1, 2005.
G Robert Grimm New York University Disco.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.
Chapter 1 and 2 Computer System and Operating System Overview
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
Contemporary Languages in Parallel Computing Raymond Hummel.
Sort-Last Parallel Rendering for Viewing Extremely Large Data Sets on Tile Displays Paper by Kenneth Moreland, Brian Wylie, and Constantine Pavlakos Presented.
Digital Graphics and Computers. Hardware and Software Working with graphic images requires suitable hardware and software to produce the best results.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Input / Output CS 537 – Introduction to Operating Systems.
Introduction by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
Enhancing GPU for Scientific Computing Some thoughts.
Graphics on Key by Eyal Sarfati and Eran Gilat Supervised by Prof. Shmuel Wimer, Amnon Stanislavsky and Mike Sumszyk 1.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Codeplay CEO © Copyright 2012 Codeplay Software Ltd 45 York Place Edinburgh EH1 3HP United Kingdom Visit us at The unique challenges of.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Architecture Support for OS CSCI 444/544 Operating Systems Fall 2008.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview Part 2: History (continued)
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Interrupts By Ryan Morris. Overview ● I/O Paradigm ● Synchronization ● Polling ● Control and Status Registers ● Interrupt Driven I/O ● Importance of Interrupts.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Srihari Makineni & Ravi Iyer Communications Technology Lab
THE FUNCTIONS OF OPERATING SYSTEMS TOPIC 1 THE FUNCTIONS OF OPERATING SYSTEMS CONTENT: 1. Features of operating systems 2. Scheduling 3. Interrupt handling.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
2003 Dominic Swayne1 Microsoft Disk Operating System and PC DOS CS-550-1: Operating Systems Fall 2003 Dominic Swayne.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 5: MIPS Instructions I
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Lecture on Central Process Unit (CPU)
INSE lecture 18 – Embedded systems  what they are  hardware for embedded systems  kernels for embedded systems  building embedded systems  testing.
Purpose of Operating System Monil Adhikari. Agenda Introduction Responsibilities of Operating System User Interfaces Command Line Interface Graphical.
CISC. What is it?  CISC - Complex Instruction Set Computer  CISC is a design philosophy that:  1) uses microcode instruction sets  2) uses larger.
Layers Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Lecture 01: Computer Architecture overview. Our Goals ● Have a better understanding of computer architecture – Write better (more efficient) programs.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Virtualization.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Chapter 1: A Tour of Computer Systems
Structural Simulation Toolkit / Gem5 Integration
KERNEL ARCHITECTURE.
Mobile Development Workshop
Introduction to Heterogeneous Parallel Computing
Chapter 1 Introduction.
The Main Features of Operating Systems
Computer Architecture
In Today’s Class.. General Kernel Responsibilities Kernel Organization
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Andreas Sandberg ARM Research NoMali: Understanding the Impact of Software Rendering Using a Stub GPU Andreas Sandberg ARM Research

What can be modeled in gem5? Subsystems that can be modeled in gem5 No GPU model! Simple models exist Note: gem5 models the subsystems above, not the actual products.

What a real system does Modern mobile systems contain a GPU CPU L1D L1I LPDDR3 GPU Android Workload L2 Display Controller GPU drivers Modern mobile systems contain a GPU Even watches have GPUs nowadays! The GPU is obviously used for 3D … but also used for 2D: Composition & alpha blending Rotation & scaling Driver stack is complex: Easily 100k+ lines of code Contains an optimizing compiler Can contribute to around 10M instructions/frame for complex workloads

What we normally model Software renderer instead of a real GPU CPU L1D L1I LPDDR3 Android Workload L2 Display Controller SW Renderer Software renderer instead of a real GPU Optimization friendly code Can be vectorized Easy-to-predict branches Large memory foot print Workload and software renderer compete for resources Can significantly skew core behavior Affects 2D applications and 3D applications

What about simulating the GPU? CPU L1D L1I LPDDR3 GPU Android Workload L2 Display Controller GPU drivers Pros: Golden reference – everything looks like a real system! Captures memory system interactions Graphics output Cons: GPU models add a lot of simulation overhead Realistic models not available to the research community Solution: Don’t simulate the GPU!

Introducing NoMali Looks like a GPU Runs the full driver stack CPU L1D L1I LPDDR3 NoMali Android Workload L2 Display Controller GPU drivers Looks like a GPU Provides the same register interface Simulates interrupts Runs the full driver stack Pretends to run rendering jobs Doesn’t render anything Signals job completion immediately Available to the community … but doesn’t produce any display output

Mali GPU overview Application Memory Registers & Interrupts Driver CPU Job Descriptors Input Data Intermediate Data Output Data Shared Memory See: http://malideveloper.arm.com/downloads/ARM_Game_Developer_Days/PDFs/2-Mali-GPU-architecture-overview-and-tile-local-storage.pdf Shader Cores Job Manager Display Processor Midgard Hardware See AnandTech for a good architecture overview

Mali GPU overview: The Job Manager Application Abstracts the underlying µ-architecture Controls most aspects of the GPU: Job scheduling Interupts Address translation Caches … Job submission through a register interface Job parameters in main memory: Job Descriptor Interrupt on job completion Driver Job Descriptors Input Data See http://www.arm.com/files/event/8_steve_steele_arm_gpus_now_and_in_the_future.pdf Job Manager

NoMali overview Application Memory Registers & Interrupts Driver No output data CPU Job Descriptors Input Data Intermediate Data Output Data Shared Memory See: http://malideveloper.arm.com/downloads/ARM_Game_Developer_Days/PDFs/2-Mali-GPU-architecture-overview-and-tile-local-storage.pdf Blank screen Shader Cores NoMali Job Manager Job Manager Display Processor Midgard Hardware No rendering

Comparing simulation strategies Three experiments: Software rendering, NoMali, GPU reference Experimental setup: Android 4.4 (KitKat ) BBench Identical disk images in all experiments Software rendering results in useless CPU performance NoMali is within 2% for CPU performance … and 35% faster! 103% 73% 135% 54%

Model limitation: GPU bandwidth GPU memory system interactions not simulated Could potentially be faked using traffic generators Absolute difference for bbench ~75 MiB/s Not likely to be a problem for CPU-centric studies Graphics workloads would experience a larger impact from the GPU NoMali was never intended for that use case. 103% 73% 135% 54%

gem5 Issue: inefficient uncacheable memory Mali Midgard-series GPUs are IO coherent GPU snoops into CPU caches CPUs can’t snoop into the GPU’s caches The driver disables caching for many regions used by both the GPU and CPU Wasn’t handled efficiently by gem5 Uncacheable accesses were always strictly ordered Resulted in CPIs ~50 (should’ve been ~2) Fix committed in early May 2015 Workload Android GPU drivers CPU L1D L1I CPU L1D L2 L1I LPDDR3 Mali GPU GPU Memory System

gem5 integration plan NoMali Model available on GitHub https://github.com/ARM-software/nomali-model gem5 integration on Review Board [RB2867, RB2869] Requires drivers Will make use of drivers available from MaliDeveloper Requires a recent Android version (KitKat or LolliPop) Android KitKat build instructions will be on the Wiki shortly

Questions?