A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop.

Slides:

Advertisements

Similar presentations

The University of Adelaide, School of Computer Science

Advertisements

Distributed Systems CS

SE-292 High Performance Computing

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

Previously… Processes –Process States –Context Switching –Process Queues Threads –Thread Mappings Scheduling –FCFS –SJF –Priority scheduling –Round Robin.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Spark: Cluster Computing with Working Sets

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

Lock-free Cache-friendly Software Queue for Decoupled Software Pipelining Student: Chen Wen-Ren Advisor: Wuu Yang 學生 : 陳韋任指導教授 : 楊武 Abstract Multicore.

Big Picture Lab 4 Operating Systems Csaba Andras Moritz.

The University of Adelaide, School of Computer Science

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. System Level Resource Discovery and Management for Multi Core Environment Javad.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

Multi-core architectures. Single-core computer Single-core CPU chip.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Multi-Core Architectures

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Lecture 13: Multiprocessors Kai Bu

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

1/12 Distributed Transactional Memory for Clusters and Grids EuroTM, Paris, May 20th, 2011 Michael Schöttner.

Martin Kruliš by Martin Kruliš (v1.1)1.

Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Understanding Parallel Computers Parallel Processing EE 613.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

UPC Status Report - 10/12/04 Adam Leko UPC Project, HCS Lab University of Florida Oct 12, 2004.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Background Computer System Architectures Computer System Software.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Tuning Threaded Code with Intel® Parallel Amplifier.

The University of Adelaide, School of Computer Science

Big Picture Lab 4 Operating Systems C Andras Moritz

UNIT-IV Designing Classes – Access Layer ‐ Object Storage ‐ Object Interoperability.

Community Grids Laboratory

Software Coherence Management on Non-Coherent-Cache Multicores

CS5102 High Performance Computer Systems Thread-Level Parallelism

Welcome: Intel Multicore Research Conference

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Task Scheduling for Multicore CPUs and NUMA Systems

Reactive Synchronization Algorithms for Multiprocessors

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Erlang Multicore support

Declarative Transfer Learning from Deep CNNs at Scale

Getting to the root of concurrent binary search tree performance

Programming with Shared Memory Specifying parallelism

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Lecture 17 Multiprocessors and Thread-Level Parallelism

Programming with Shared Memory Specifying parallelism

Support for Adaptivity in ARMCI Using Migratable Objects

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop

Chao Mei Parallel Programming Lab, UIUC Motivation Clusters are built from multicore chips  4 cores/node on BG/P  8 cores/node on Abe (2 Intel quad-core chips)  16 cores/node on Ranger (4 AMD quad-core chips) …… Charm has a building version for SMP node for many years  Not tuned So, what are the issues for getting high performance?

Chao Mei Parallel Programming Lab, UIUC Start with a kNeighbor benchmark A synthetic kNeighbor benchmark  Each element communicates with its neighbors in K-stride (wrap- around), and then neighbors send back an acknowledge.  An iteration: all elements finish the above communication Environment  A smp node with 2 Xeon quadcores, only use 7 cores  Ubuntu 7.04; gcc 4.2  Charm: net-linux-amd64-smp vs. net-linux-amd64  1 element/core, K=3

Chao Mei Parallel Programming Lab, UIUC Performance at first glance

Chao Mei Parallel Programming Lab, UIUC Outline Examine the communication model in Charm++ between the Non-SMP and SMP layers Describe current optimizations for SMP step by step Talk about a different approach to utilize multicore Conclude with the future work

Chao Mei Parallel Programming Lab, UIUC Communication model for the multicore

Chao Mei Parallel Programming Lab, UIUC Possible overheads in SMP version Locks  Overusing locks to ensure correctness  Locks in message queues …… False sharing  Some per thread data structures are allocated together in an array form: e.g. each element in “CmiState state[numThds]” belongs to a thread

Chao Mei Parallel Programming Lab, UIUC Reducing the usage of locks By examining the source codes, finding overuse of locks  Narrower sections enclosed by locks

Chao Mei Parallel Programming Lab, UIUC Overhead in message queues A micro-benchmark to show the overhead in message queues  N producers, 1 consumer  lock vs. memory fence + atomic operation (fetch- and-increment)  1 queue vs. N queues 1. Each producer produces 10K items per iteration 2. One iteration: consumer cosumes all items

Chao Mei Parallel Programming Lab, UIUC Applying multi Q + Fence Less than 2% improvement  Much less contention compared with the micro-benchmark

Chao Mei Parallel Programming Lab, UIUC Big overhead in msg allocation We noticed that: We used our own default memory module  Every memory allocation is protected by a lock  Provide some useful functionalities in Charm++ system (a historic reason not using other memory modules) memory footprint information, memory debugger Isomalloc

Chao Mei Parallel Programming Lab, UIUC Switching to OS memory module We don’t lose the aforementioned functionalities by recent updates

Chao Mei Parallel Programming Lab, UIUC Identifying false sharing overhead Another micro-benchmark  Each element repeatedly sends itself a message, but each time the message is reused (i.e., not allocating a new message)  Benchmark timing of 1000 iterations Use Intel VTune performance analysis tool  Focusing on the cache misses caused by “Invalidate” in the MESI coherence protocol Declaring variables with “__thread” specifier will make them thread private

Chao Mei Parallel Programming Lab, UIUC Performance for the micro-benchmark Parameters: 1 element/core, 7 cores Before: us per iteration After: us per iteration

Chao Mei Parallel Programming Lab, UIUC Adding the gains from removing false sharing Around 1% improvement

Chao Mei Parallel Programming Lab, UIUC Rethinking communication model Posix-shared memory layer No threads, every core still runs a process Inter-core message passing doesn’t go through NIC, but through memory copy (inter-process communication)

Chao Mei Parallel Programming Lab, UIUC Performance comparison

Chao Mei Parallel Programming Lab, UIUC Future work Other platform  BG/P Optimize the posix shared memory version Effects on real applications  For NAMD, initial result shows that SMP helps up to 24 nodes on Abe Any other communication models  Adaptive one?