Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Slides:

Advertisements

Similar presentations

Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.

Advertisements

Shredder GPU-Accelerated Incremental Storage and Computation

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Optimization on Kepler Zehuan Wang

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

CUDA and the Memory Model (Part II). Code executed on GPU.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

SAGE: Self-Tuning Approximation for Graphics Engines

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

IP Routing Processing with Graphic Processors Author: Shuai Mu, Xinya Zhang, Nairen Zhang, Jiaxin Lu, Yangdong Steve Deng, Shu Zhang Publisher: IEEE Conference.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

QCAdesigner – CUDA HPPS project

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Sunpyo Hong, Hyesoon Kim

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

Clusters of Computational Accelerators

Linchuan Chen, Xin Huo and Gagan Agrawal

Department of Computer Science University of California, Santa Barbara

Yiannis Nikolakopoulos

Department of Computer Science University of California, Santa Barbara

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh

Background Architects simulating more cores ▫Increasing simulation times Cannot keep doing single-threaded simulations if we want to see results in a reasonable time frame Host Machine Simulates Target Machine

Parallel Simulation Overview A number of people have begun researching multithreaded simulation Multithreaded simulations have some key limitations ▫Many fewer host cores than the cores in a target machine ▫Slow communication between threads Graphics processors have high fine-grained parallelism ▫Many more cores ▫Cheap communication within ‘blocks’ (software unit) We propose using GPUs to accelerate timing simulation CPU acts as the functional feeder

Contributions Introduce GPUs as a tool for architectural timing simulation Implement a proof-of-concept multi-cache simulator Study strengths and weaknesses of GPU-based timing simulation

Outline GPU programming with CUDA Multi-cache simulation using CUDA Performance Results vs CPU ▫Impact of Thread Interaction ▫Optimizations Conclusions

CUDA Dataflow Host (CPU) Main Memory Device (GPU) SIMD Processors Graphics Memory PCIe Bus

Dataflow – Concurrent CPU/GPU Host (CPU) Main Memory Device (GPU) SIMD Processors Graphics Memory PCIe Bus

Dataflow – Concurrent Kernels Host (CPU) Main Memory Device (GPU) SIMD Processors Graphics Memory PCIe Bus

GPU-driven Cache Simulation Trace-driven Simulation Host (CPU) Device (GPU) L1 kernelL2 kernel Trace Data L1 Misses Statistics Trace Data

GPU-driven Cache Simulation Each cache way is simulated by a thread ▫Parallel address lookup ▫Communicate via fast shared memory Ways from 4 caches form a block ▫With 16-way caches, 64 threads per block Cache state (tags+metadata) stored in global memory – rely on caching for fast access Experimented with tree-based reduction ▫No performance improvement (small tree)

Block-to-Block Interactions Within a block, we can call a cheap barrier and there is no inaccuracy Shared L2 ▫Upon miss, L1 threads determine L2 home tile ▫Atomically add miss to global memory buffer for L2 to process Write Invalidations ▫Upon write, L1 thread checks global memory for tag state of other L1 threads ▫Atomically invalidate matching lines in global memory

Evaluation Feed traces of memory accesses to simulated cache hierarchy ▫Mix of PARSEC benchmarks L1/L2 cache hierarchy with private/shared L2 GeForce GTS 450 – 192 cores (low-mid range) ▫Fermi GPU – caching, simultaneous kernels ▫Newer NVIDIA GPUs range from cores

Private L2 Linear CPU simulation scaling GPU sees 13-60% slowdown from 32 to 96 caches Multithreaded

Shared L2 Unbalanced traffic load to a few tiles Largely serialized Multithreaded

Inaccuracy from Thread Interaction CUDA currently has little support for synchronization between blocks Without synchronization support, inter-thread communication is subject to error: ▫Shared L2 caches – miss rate ▫Write invalidations – invalidation count

Controlling Error Only way to synchronize blocks is in between kernel invocations ▫Number of trace items processed by each kernel invocation controls error ▫Similar techniques used in parallel CPU simulation There is a performance and error tradeoff with varying trace chunk size

Invalidation – Performance vs Error

Shared L2 – Miss Rate Error Largely serialized execution minimizes error

Transfer memory while executing kernels Run L1 kernel concurrently with L2 kernel Concurrent Execution Trace IO L1 Kernel L2 Kernel Trace IO L1 Kernel L2 Kernel Trace IO L1 Kernel L2 Kernel

Concurrent Execution Speedup Greater benefit when more data is transferred Benefits end when GPU is fully utilized Load imbalance among L2 slices Cache Model From better utilization of GPU From parallel memory transfer and computation

CUDA Block Mapping For maximum throughput, CUDA requires a balance between number of blocks and threads per block ▫Each block can support many more threads than the number of ways in each cache We map 4 caches to each block for maximum throughput Also study tradeoff from fewer caches per block

Block Mapping - Scaling Saturated at 32 tiles Saturated at 64 tiles More Caches per Block Higher Throughput More Caches per Block Higher Throughput Fewer Caches per Block Lower Latency Fewer Caches per Block Lower Latency

Conclusions Even with a low-end GPU, we can simulate more and more caches with very small slowdown With a GPU co-processor we can leverage both CPU and GPU processor time Crucial that we balance the load between blocks

Future Work Better load balance (adaptive mapping) More detailed timing model Comparisons against multi-threaded simulation Studies with higher-capacity GPUs