1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
High Performing Cache Hierarchies for Server Workloads
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Understanding a Problem in Multicore and How to Solve It
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
HPArch Research Group. |Part 2. Overview of MacSim Introduction For black box approach users |Part 3: Details of MacSim For computer architecture researchers.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment.
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
HPArch Research Group. |Part III: Overview of MacSim Features of MacSim Basic MacSim architecture How to simulate architectures with MacSim |Part IV:
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.
1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Zhichun Zhu Zhao Zhang ECE Department ECE Department
ASR: Adaptive Selective Replication for CMP Caches
Managing GPU Concurrency in Heterogeneous Architectures
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Rachata Ausavarungnirun
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Achieving High Performance and Fairness at Low Cost
Managing GPU Concurrency in Heterogeneous Architectures
Lecture: Cache Innovations, Virtual Memory
Lecture 15: Large Cache Design III
Introduction to Heterogeneous Parallel Computing
Operation of the Basic SM Pipeline
Lecture: Cache Hierarchies
A Case for Interconnect-Aware Architectures
CSE 502: Computer Architecture
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Address-Stride Assisted Approximate Load Value Prediction in GPUs
Haonan Wang, Adwait Jog College of William & Mary
Presented by Ondrej Cernin
Presentation transcript:

1 MacSim Tutorial (In ISCA-39, 2012)

Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache studies (sharing, inclusion) DRAM scheduling Interconnection studies Software and Hardware prefetcher Cache studies (sharing, inclusion) DRAM scheduling Interconnection studies Power model Front-endMemory SystemMisc. 2/8 MacSim Tutorial (In ISCA-39, 2012)

Memory System Trace Generator (PIN, GPUOCelot) Trace Generator (PIN, GPUOCelot) Hardware Prefetcher Frontend Software prefetch instructions PTX  prefetch, prefetchu x86  prefetcht0, prefetcht1, prefetchnta Hardware prefetch requests Stream, stride, GHB, … Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010] When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012] MacSim 3/8 MacSim Tutorial (In ISCA-39, 2012)

|Cache studies – sharing, inclusion property |On-chip interconnection studies TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012] $ $ $ $ $ $ $ $ $ $ $ $ $ $ Shared $ Interconnection Private Caches Interconnection Shared Cache 4/8 MacSim Tutorial (In ISCA-39, 2012)

|Heterogeneous link configuration Ring Network GPU CPU L3 MC Different topologies CCMM CCMM CCGG CCGG C0 L3 G0 M1 C1C2 G1G2 M0L3 C0 L3 G0 M1 C1C2 G1G2 M0L3 On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. under review] 5/8 MacSim Tutorial (In ISCA-39, 2012)

Execution Trace Generator (GPUOCelot) Trace Generator (GPUOCelot) Frontend Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim, LCA-GPGPU, 2010] DRAM RR, ICOUNT, FAIR, LRF, … FCFS, FRFCFS, FAIR, … 6/8 MacSim Tutorial (In ISCA-39, 2012)

DRAM Bank DRAM Controller Core-0 Core-1 Qs for Core-0 RH RM RH RM RH RM RH RM RH RM RH RM W0 W1 W2 W3 Tolerance(Core-0) < Tolerance(Core-1) Qs for Core-1 RH RM RH RM RH W0 W1 W2 W3 Potential of Requests from Core-0 = |W0| α + |W1| α + |W2| α + |W3| α = 4 α + 3 α + 5 α (α < 1) Reduction in potential if: row hit from queue of length L is serviced next  L α – (L – 1) α row hit from queue of length L is serviced next  L α – (L – 1/m) α m = cost of servicing row miss/cost of servicing row hit Potential of Requests from Core-0 = |W0| α + |W1| α + |W2| α + |W3| α = 4 α + 3 α + 5 α (α < 1) Reduction in potential if: row hit from queue of length L is serviced next  L α – (L – 1) α row hit from queue of length L is serviced next  L α – (L – 1/m) α m = cost of servicing row miss/cost of servicing row hit Tolerance(Core-0) < Tolerance(Core-1)  select Core-0 Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next Tolerance(Core-0) < Tolerance(Core-1)  select Core-0 Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al. IEEE CAL, 2011] 7/8 MacSim Tutorial (In ISCA-39, 2012)

|Verifying simulator and GTX580 |Modeling X86-CPU power |Modeling GPU power Still on-going research 8/8 MacSim Tutorial (In ISCA-39, 2012)

2012 ~ 2013 Power/Energy Model ARM Architecture Mobile Platform OpenGL Program MacSim Tutorial (In ISCA-39, 2012)