Toward Cache-Friendly Hardware Accelerators

Slides:

Advertisements

Similar presentations

Please do not distribute

Advertisements

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.

Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.

Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems.

Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.

Rapid Exploration of Accelerator-rich Architectures: Automation from Concept to Prototyping David Brooks, Yu-Ting Chen, Jason Cong, Zhenman Fang, Brandon.

Please do not distribute

Andreas Sandberg ARM Research

Hardware Overview Net+ARM – Well Suited for Embedded Ethernet

Please do not distribute

2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.

The MachSuite Benchmark

Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Multi-Core Architectures

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

DSP Lecture Series DSP Memory Architecture Dr. E.W. Hu Nov. 28, 2000.

A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.

Miseon Han Thomas W. Barr, Alan L. Cox, Scott Rixner Rice Computer Architecture Group, Rice University ISCA, June 2011.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

Electronic system level design Teacher : 蔡宗漢 Electronic system level Design Lab environment overview Speaker: 范辰碩 2012/10/231.

Virtual Memory.  Next in memory hierarchy  Motivations:  to remove programming burdens of a small, limited amount of main memory  to allow efficient.

Caches for Accelerators

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

Lluc Álvarez, Lluís Vilanova, Miquel Moretó, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, Mateo Valero Coherence.

Logical & Physical Address Nihal Güngör. Logical Address In simplest terms, an address generated by the CPU is known as a logical address. Logical addresses.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.

Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.

Programmable Accelerators

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

Please do not distribute

Zorua: A Holistic Approach to Resource Virtualization in GPUs

Please do not distribute

TI Information – Selective Disclosure

Please do not distribute

Please do not distribute

Please do not distribute

ECE354 Embedded Systems Introduction C Andras Moritz.

Please do not distribute

Today How was the midterm review? Lab4 due today.

Stash: Have Your Scratchpad and Cache it Too

Mosaic: A GPU Memory Manager

Morgan Kaufmann Publishers

Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM

Rachata Ausavarungnirun

MASK: Redesigning the GPU Memory Hierarchy

Interconnect with Cache Coherency Manager

Virtual Memory.

Horizontally Partitioned Hybrid Main Memory with PCM

Introduction to Heterogeneous Parallel Computing

Exascale Programming Models in an Era of Big Computation and Big Data

CSE 502: Computer Architecture

Fast Accesses to Big Data in Memory and Storage Systems

Border Control: Sandboxing Accelerators

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Toward Cache-Friendly Hardware Accelerators Yakun Sophia Shao, Sam Xi, Viji Srinivasan, Gu-Yeon Wei, David Brooks

Please do not distribute 4/17/2017 More accelerators. Out-of-Core Accelerators Maltiel Consulting estimates www.anandtech.com/show/8562/chipworks-a8 http://www.maltiel-consulting.com/Next-Apple-iPhone-iPad-A-Processor.html Shao (Harvard) estimates [Die photo from Chipworks] [Accelerators annotated by Sophia Shao @ Harvard] GYW

Please do not distribute 4/17/2017 Today’s SoC OMAP 4 SoC GYW

Please do not distribute 4/17/2017 Today’s SoC ARM Cores GPU DSP SD USB Audio Video Face Imaging System Bus OMAP 4 SoC GYW

Please do not distribute 4/17/2017 Today’s SoC DMA ARM Cores GPU DSP SD USB Audio Video Face Imaging SPM System Bus OMAP 4 SoC GYW

Cache-Friendly Accelerator Interface Coherent Accelerator Processor Interface Virtual Addressing & Data Caching Easier, Natural Programming Model Power 8 PCIe Bus

It’s the beginning, not the end.

It’s the beginning, not the end. Please do not distribute 4/17/2017 It’s the beginning, not the end. GYW

Not one size fits all. Different applications have different memory requirements. Need to customize their memory designs.

Infrastructure Building Please do not distribute 4/17/2017 Infrastructure Building GPGPU-Sim GPU Big Cores Small Cores gem5’s CPU Model gem5’s CPU gem5’s DRAM Model Memory Interface Shared Resources gem5’s Cache Model w/ Cacti Accelerators GYW

Please do not distribute 4/17/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems Programmability [ISCA’2014] http://vlsiarch.eecs.harvard.edu/accelerators GYW

Cache Customization TLB Designs: TLB can be expensive. Performance: TLB miss. Resource/Power: Hardware TLB design. But accelerator’s TLB accesses are very likely to be regular.

Accelerator TLB Miss Behavior

Accelerator TLB Miss Behavior

Cache Customization TLB Designs: Cache Prefetcher Designs: TLB can be expensive. Performance: TLB miss. Resource/Power: Hardware TLB design. But accelerator’s TLB accesses are very likely to be regular. Cache Prefetcher Designs:

Inefficient Bulk Data Transfer DMA is very efficient in getting data. Cache fetches data at cache line granularity. Cache prefetcher customization. Benchmark: kmp

Workloads have different memory behaviors. Benchmark: md-knn

Toward Cache-Friendly Hardware Accelerators With more accelerators on the SoCs, programming them will become challenging. Shared address space and caching make programming accelerators easier. Leveraging the application-specific nature of accelerators can reduce the overhead of cache.