Toward Cache-Friendly Hardware Accelerators

Slides:



Advertisements
Similar presentations
Please do not distribute
Advertisements

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.
Mohammadsadegh Sadri, Christian Weis, Norbert When and Luca Benini
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.
Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems.
Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.
Rapid Exploration of Accelerator-rich Architectures: Automation from Concept to Prototyping David Brooks, Yu-Ting Chen, Jason Cong, Zhenman Fang, Brandon.
Please do not distribute
Andreas Sandberg ARM Research
Hardware Overview Net+ARM – Well Suited for Embedded Ethernet
Please do not distribute
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
The MachSuite Benchmark
Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Multi-Core Architectures
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
DSP Lecture Series DSP Memory Architecture Dr. E.W. Hu Nov. 28, 2000.
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.
Miseon Han Thomas W. Barr, Alan L. Cox, Scott Rixner Rice Computer Architecture Group, Rice University ISCA, June 2011.
CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.
Electronic system level design Teacher : 蔡宗漢 Electronic system level Design Lab environment overview Speaker: 范辰碩 2012/10/231.
Virtual Memory.  Next in memory hierarchy  Motivations:  to remove programming burdens of a small, limited amount of main memory  to allow efficient.
Caches for Accelerators
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
Lluc Álvarez, Lluís Vilanova, Miquel Moretó, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, Mateo Valero Coherence.
Logical & Physical Address Nihal Güngör. Logical Address In simplest terms, an address generated by the CPU is known as a logical address. Logical addresses.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.
Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.
Programmable Accelerators
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Please do not distribute
Zorua: A Holistic Approach to Resource Virtualization in GPUs
Please do not distribute
TI Information – Selective Disclosure
Please do not distribute
Please do not distribute
Please do not distribute
ECE354 Embedded Systems Introduction C Andras Moritz.
Please do not distribute
Today How was the midterm review? Lab4 due today.
Stash: Have Your Scratchpad and Cache it Too
Mosaic: A GPU Memory Manager
Morgan Kaufmann Publishers
Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM
Rachata Ausavarungnirun
MASK: Redesigning the GPU Memory Hierarchy
Interconnect with Cache Coherency Manager
Virtual Memory.
Horizontally Partitioned Hybrid Main Memory with PCM
Introduction to Heterogeneous Parallel Computing
Exascale Programming Models in an Era of Big Computation and Big Data
CSE 502: Computer Architecture
Fast Accesses to Big Data in Memory and Storage Systems
Border Control: Sandboxing Accelerators
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

Toward Cache-Friendly Hardware Accelerators Yakun Sophia Shao, Sam Xi, Viji Srinivasan, Gu-Yeon Wei, David Brooks

Please do not distribute 4/17/2017 More accelerators. Out-of-Core Accelerators Maltiel Consulting estimates www.anandtech.com/show/8562/chipworks-a8 http://www.maltiel-consulting.com/Next-Apple-iPhone-iPad-A-Processor.html Shao (Harvard) estimates [Die photo from Chipworks] [Accelerators annotated by Sophia Shao @ Harvard] GYW

Please do not distribute 4/17/2017 Today’s SoC OMAP 4 SoC GYW

Please do not distribute 4/17/2017 Today’s SoC ARM Cores GPU DSP SD USB Audio Video Face Imaging System Bus OMAP 4 SoC GYW

Please do not distribute 4/17/2017 Today’s SoC DMA ARM Cores GPU DSP SD USB Audio Video Face Imaging SPM System Bus OMAP 4 SoC GYW

Cache-Friendly Accelerator Interface Coherent Accelerator Processor Interface Virtual Addressing & Data Caching Easier, Natural Programming Model Power 8 PCIe Bus

It’s the beginning, not the end.

It’s the beginning, not the end. Please do not distribute 4/17/2017 It’s the beginning, not the end. GYW

Not one size fits all. Different applications have different memory requirements. Need to customize their memory designs.

Infrastructure Building Please do not distribute 4/17/2017 Infrastructure Building GPGPU-Sim GPU Big Cores Small Cores gem5’s CPU Model gem5’s CPU gem5’s DRAM Model Memory Interface Shared Resources gem5’s Cache Model w/ Cacti Accelerators GYW

Please do not distribute 4/17/2017 Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Unmodified C-Code Accelerator Design Parameters (e.g., # FU, mem. BW) Private L1/ Scratchpad Aladdin Accelerator Specific Datapath Power/Area Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems Programmability [ISCA’2014] http://vlsiarch.eecs.harvard.edu/accelerators GYW

Cache Customization TLB Designs: TLB can be expensive. Performance: TLB miss. Resource/Power: Hardware TLB design. But accelerator’s TLB accesses are very likely to be regular.

Accelerator TLB Miss Behavior

Accelerator TLB Miss Behavior

Cache Customization TLB Designs: Cache Prefetcher Designs: TLB can be expensive. Performance: TLB miss. Resource/Power: Hardware TLB design. But accelerator’s TLB accesses are very likely to be regular. Cache Prefetcher Designs:

Inefficient Bulk Data Transfer DMA is very efficient in getting data. Cache fetches data at cache line granularity. Cache prefetcher customization. Benchmark: kmp

Workloads have different memory behaviors. Benchmark: md-knn

Toward Cache-Friendly Hardware Accelerators With more accelerators on the SoCs, programming them will become challenging. Shared address space and caching make programming accelerators easier. Leveraging the application-specific nature of accelerators can reduce the overhead of cache.