SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

Slides:



Advertisements
Similar presentations
Reduce Cost & Complexity Partner logo here Presenters Name (16pt) Presenters Title (14pt) Company/ (14pt) Manage and Deploy Applications using Virtualization.
Advertisements

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.
Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.
EVOLUTION OF MULTIMEDIA & DISPLAY MAZEN SALLOUM 26 FEB 2015.
Innovations in Structured Products October 25, 2010 An Innovator’s Dilemma?
SAN Design Considerations Hylton Leigh Senior Consultant Novell Consulting, UK Stuart Thompson Senior Consultant Novell Consulting, UK.
1 Copyright © 2012 Mahindra & Mahindra Ltd. All rights reserved. 1 Change Management – Process and Roles.
Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.
1 Copyright © 2012 Mahindra & Mahindra Ltd. All rights reserved. 1 Hybrid Projects - Defect Management - Process and Roles.
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
Data Intensive Computing on Heterogeneous Platforms Norm Rubin Fellow GPG graphics products group AMD HPEC 2009.
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.
Filtering Approaches for Real-Time Anti-Aliasing /
AMD platform security processor
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Evaluation of a DAG with Intel® CnC Mark Hampton Software and Services Group CnC MIT July 27, 2010.
Configuring Identity Manager 2 (formerly DirXML ® ) for JDBC (w/DirXML) Jason Elsberry Software Engineer
Copyright 2011, Atmel December, 2011 Atmel ARM-based Flash Microcontrollers 1 1.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.
Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.
Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.
LiveDist: Real-Time Distribution of Databases, with High-Volume of Updates Dynamic and selective distribution of a central or distributed database, to.
Enhancement Package Innovations Gabe Rodriguez - Halliburton Stefan Kneis – SAP Marco Valencia - SAP.
1 Copyright © 2012 Mahindra & Mahindra Ltd. All rights reserved. 1 Defect Management - Process and Roles.
DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.
Oracle Fusion Applications 11gR1 ( ) Functional Overview (L2) Manage Inbound Logistics (L3) Manage and Disposition Inventory Returns.
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.
Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.
DIR-826L Wireless N600 Gigabit Cloud Router Sales Guide WRPD Jan 25 th, 2012 D-LINK HQ.
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
Advanced Technical Support (ATS) Americas © 2007 IBM Corporation What is FlashCopy? FlashCopy® is an “instant” T0 (Time 0) copy where the source and target.
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Oracle Proprietary and Confidential. 1.
PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.
From Source Code to Packages and even whole distributions By Cool Person From openSUSE.
Connectivity to bank and sample account structure
µC-States: Fine-grained GPU Datapath Power Management
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
Receptacle Housings for M-Style Infinite Switches
Securing the Future of Payments
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration
Measuring and Modeling On-Chip Interconnect Power on Real Hardware
BLIS optimized for EPYCTM Processors
Parallelspace PowerPoint Template for ArchiMate® 2.1 version 2.0
The Small batch (and Other) solutions in Mantle API
Heterogeneous System coherence for Integrated CPU-GPU Systems
Many-core Software Development Platforms
hLRC: Lazy Release Consistency For GPUs
Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,
In-depth on the memory system
SOC Runtime Gregory Stoner.
libflame optimizations with BLIS
Self-Registration walk-through
Interference from GPU System Service Requests
Simulation of exascale nodes through runtime hardware monitoring
Interference from GPU System Service Requests
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
RegMutex: Inter-Warp GPU Register Time-Sharing
Compute Shaders Optimize your engine using compute
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
Advanced Micro Devices, Inc.
Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018
Presentation transcript:

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ † UW-MADISON, § AMD RESEARCH ASPLOS, MARCH 16, 2015

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, EXECUTIVE SUMMARY All Global Synchronization Scoped Synchronization Work Stealing Best of Both? NEW: Remote-Scope Promotion (7% Speedup)(18% Speedup) (25% Speedup) Heterogeneous chips, like GPUs, have hierarchical memories

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, OUTLINE  Background: Synchronization + Scopes  Synchronization using Remote-Scope Promotion  Results/Conclusion

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, BACKGROUND: SYNCHRONIZATION + SCOPES  Parallel Synchronization semantics ‒acquire: pull latest data (to me) ‒release: push latest data (to others)  Scopes bound synchronization: ‒Smaller scope  less synchronization overhead

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, wg scope0wg scope1 ACQUIRE/RELEASE ANIMATION void incX_ workgroup () { } while (!CAS_ acq_wg (&L, 0, 1)); X = X + 1; st_ rel_wg (&L, 0); void incX_ component () { } while (!CAS_ acq_cmp (&L, 0, 1)); X = X + 1; st_ rel_cmp (&L, 0); component scope L1 Cache L2 X = 2 L = 0 CU0CU1 X = 1 L = 1 X = 3 L = 0 L = 1 X = 4 L = 0

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, SCOPED SYNCHRONIZATION’S STRENGTHS Static local sharingDynamic global sharing component scope wg_scope0 data 0 wg_scope1 data 1 wg scope0 global data store wg scope1 On current hardware, wg scope can yield >20% speedup over cmp scope

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, SCOPED SYNCHRONIZATION’S LIMITATIONS  Dynamic local sharing: some threads access shared data less frequently than others in an ad-hoc manner  Example: work stealing component scope queue 0 stale wg scope0 wg scope1 queue 1 enq deq queue 0

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, OUTLINE  Background: Synchronization + Scopes  Synchronization using Remote-Scope Promotion  Results/Conclusion

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, REMOTE-SCOPE PROMOTION  Insight: wg1 needs to trigger the promotion of scope 0  Contribution: hardware support for scope promotion & ISA instructions that utilize it component scope queue 0 stale wg_scope0 wg_scope 1 queue 1 queue 0 promote flush deq queue 0

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16,  Prior memory models: HRF-direct, HRF-indirect ‒Invariant: acquire/release pair must occur at the same scope  Three new memory orders: st_rel_cmp(L, 0) PROMOTION SEMANTIC st(V,2) st_rel_wg(L, 0) cas_acq_wg(&L, 0, 1) ld(R1, V) work-item 0 (in wg 0)work-item 1 (in wg 1) OK cas_acq_cmp(&L, 0, 1) RACE! cas_rm_acq_cmp(&L, 0, 1) OK synchronizes-with relationship promotion remoteAcquirePromote the scope of last release to the scope of this acquire, then perform acquire remoteReleasePromote the scope of next acquire to the scope of this release, then perform release remoteAcquire+Releasecombine remote acquire & remote release st_rel_wg(L, 0)

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, IMPLEMENTATION  remote_acq_cmp(L)  remote_rel_cmp(L) CU0 L1 Cache CU1 L1 Cache L2 L = 1 V = 3 V = 2 CU2 L1 Cache FLUSH V = 3 FLUSH L = 0 L = 1 promote 1.Promote the scope of the last release on L 2.Perform an acquire operation on L 1.Perform a release operation on L 2.Promote the scope of the next acquire on L

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, IMPLEMENTATION DETAILS  Hardware Support ‒Sending/receiving sub-operations between CUs ‒Cache line locking to resolve races  Guarantee “coherence order” for read-modify-writes ‒Hardware support to stall new synchronization operations at target scope  Paper formalizes scope promotion ‒Shows that scope promotion is compatible with coherence order

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, OUTLINE  Background: Synchronization + Scopes  Synchronization using Remote-Scope Promotion  Results/Conclusion

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, METHODOLOGY  Prototyped remote scoped synchronization in gem5 ‒Extended with internal GPU model  Refactored 3 Pannotia workloads to retrieve graph nodes from task queues ‒SSSP, Color, PageRank (each run with 3-4 inputs)

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, RESULTS scenarioScope of sync.?Work stealing? 1.07x1.18x1.25x baselineglobalno scope-onlylocalno steal-onlyglobalYes rem-synclocalYes

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, CONCLUSION All Global Synchronization Scoped Synchronization Work Stealing NEW: Remote-Scope Promotion (7% Speedup)(18% Speedup) (25% Speedup) Best of Both!

Questions?

Backup

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, µ BENCHMARK RESULTS  Scopes matter! Small tasks benefit from scopes

| SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION | MARCH 16, DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.