Presented by David Cravey 10/15/2011. About Me – David Cravey Started programming in 4 th grade Learned BASIC on a V-Tech “Precomputer 1000” and then.

Slides:

Advertisements

Similar presentations

1 Multithreaded Programming in Java. 2 Agenda Introduction Thread Applications Defining Threads Java Threads and States Examples.

Advertisements

Writing Modern C++ Marc Grégoire Software Architect April 3 rd 2012.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Concurrency Important and difficult (Ada slides copied from Ed Schonberg)

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

for (i = 0; i < 1024; i++) C[i] = A[i]*B[i]; for (i = 0; i < 1024; i+=4) C[i:i+3] = A[i:i+3]*B[i:i+3]; #pragma loop(hint_parallel ( N ) ) for.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

1 Friday, June 16, 2006 "In order to maintain secrecy, this posting will self-destruct in five seconds. Memorize it, then eat your computer." - Anonymous.

Chapter 5 Processes and Threads Copyright © 2008.

Concurrency CS 510: Programming Languages David Walker.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

C++ for Engineers and Scientists Third Edition

Chapter 8: Introduction to High-level Language Programming Invitation to Computer Science, C++ Version, Third Edition.

Contemporary Languages in Parallel Computing Raymond Hummel.

Fundamentals of Python: From First Programs Through Data Structures

Visual Studio 11 for Game Developers Boris Jabes Senior Program Manager Microsoft Corporation.

Parallel Programming in.NET Kevin Luty.  History of Parallelism  Benefits of Parallel Programming and Designs  What to Consider  Defining Types of.

C++ Accelerated Massive Parallelism in Visual C Kate Gregory Gregory Consulting DEV334.

1 Advanced Computer Programming Concurrency Multithreaded Programs Copyright © Texas Education Agency, 2013.

demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding.

Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.

Austin Java Users Group developerWorks article – µActor Library BARRY FEIGENBAUM, PH. D. 02/26/13.

Steve Teixeira Director of Program Management, Visual C++ Microsoft Corporation Visual C++ and the Native Renaissance.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Programming. What is a Program ? Sets of instructions that get the computer to do something Instructions are translated, eventually, to machine language.

Implementing Processes and Process Management Brian Bershad.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

1 Agenda Administration Background Our first C program Working environment Exercise Memory and Variables.

CGS 3763 Operating Systems Concepts Spring 2013 Dan C. Marinescu Office: HEC 304 Office hours: M-Wd 11: :30 AM.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

About Me Microsoft MVP Intel Blogger TechEd Israel, TechEd Europe Expert C++ Book

Expressing Parallel Patterns in Modern C++ Rahul V. Patil Microsoft C++ Parallel Computing Team Note: this is a simplified version of the deck used in.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.

CS333 Intro to Operating Systems Jonathan Walpole.

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Processor Architecture

Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.

This deck has 1-, 2-, and 3- slide variants for C++ AMP If your own deck uses 4:3, get with the 21 st century and switch to 16:9 ( Design tab, Page Setup.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

STL CSSE 250 Susan Reeder. What is the STL? Standard Template Library Standard C++ Library is an extensible framework which contains components for Language.

MAXIMISE.NET WITH C++ FOR INTEROP, PERFORMANCE AND PRODUCTIVITY Angel Hernandez Avanade Australia (c) 2011 Microsoft. All rights reserved. SESSION CODE:

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

My Coordinates Office EM G.27 contact time:

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Tutorial 2: Homework 1 and Project 1

Processes and Threads Processes and their scheduling

CS399 New Beginnings Jonathan Walpole.

Async or Parallel? No they aren’t the same thing!

Lighting Up Windows Server 2008 R2 Using the ConcRT on UMS

C++ Forever: Interactive Applications in the Age of Manycore

Taming GPU compute with C++ Accelerated Massive Parallelism

Creating Windows Store Apps Using Visual Basic

Programming Languages

Concurrency, Processes and Threads

CS333 Intro to Operating Systems

SPL – PS1 Introduction to C++.

Chapter 3: Process Management

Presentation transcript:

Presented by David Cravey 10/15/2011

About Me – David Cravey Started programming in 4 th grade Learned BASIC on a V-Tech “Precomputer 1000” and then GW-BASIC, and eventually QuickBasic Got bored with BASIC in 8 th Grade so moved to C++ Software Development Manager at Vivicom President of the Houston C++ User Group Meets at Microsoft’s Houston Office 1 st Thursday of Each 7PM Microsoft Visual C++ MVP

Agenda Why C++? Concurrent Runtime Tasks PPL Agents GPGPU AMP Resources Summary

The language of power!

Why C++ C++ Provides Speed Down to the metal performance! Access to the Latest Hardware and Drivers Example: GPGPU Multi-paradigm Programming Procedural Object Oriented Generic Programming High Level Programming (i.e. Strong Abstractions) Classes AND Templates But still allows you to step down to Low Level as needed! Portable Code ?

Modern C++: Clean SafeFast *Used with permission from Herb Sutter’s “Writing modern C++ code: how C++ has evolved over the years”

Automatic Memory Management Never type “delete” again! unique_ptr shared_ptr weak_ptr

What’s Different: At a Glance Then Now circle* p = new circle( 42 ); vector vw = load_shapes(); for( vector ::iterator i = vw.begin(); i != vw.end(); ++i ) { if( *i && **i == *p ) cout << **i << “ is a match\n”; } for( vector ::iterator i = vw.begin(); i != vw.end(); ++i ) { delete *i; } delete p; auto p = make_shared ( 42 ); vector > vw = load_shapes(); for_each( begin(vw), end(vw), [&]( shared_ptr & s ) { if( s && *s == *p ) cout << *s << “ is a match\n”; } ); T*  shared_ptr new  make_shared no need for “delete” automatic lifetime management exception-safe for/while/do  std:: algorithms [&] lambda functions auto type deduction not exception-safe missing try/catch, __try/__finally *Used with permission from Herb Sutter’s “Writing modern C++ code: how C++ has evolved over the years”

Because processors will keep getting more cores … but not very many more GHz!

Why Concurrency? You can deal with problems faster if you have more threads (or “light sabers”)!!! My HERO!

Why A Concurrency Runtime? According to the MSDN: A runtime for concurrency provides uniformity and predictability to applications and application components that run simultaneously. (i.e. Without a single concurrency runtime various libraries and routines will end up “competing” instead of “cooperating” for processor resources.)

Without a Concurrency Runtime OUCH! Threads will compete for system resources and the program will run slower instead of faster!!!!

With a Concurrency Runtime Success! Threads will cooperate to make maximum use of system resources and the program will faster!!!!

What does ConcRT Provide? Improved use of processing resources Cooperative Task Scheduling Cooperative Blocking Work Stealing Task Queues Low Level Building Blocks Synchronization Primitives Task Schedulers Resource Managers 2 High Level Libraries PPL – Parallel Patterns Library Agents – Asynchronous Agents Library Concurrent Container and Message Passing Libraries

ConcRT Architecture Diagram (Diagram taken from MSDN

ConcRT Task’s MSDN Basic building block for concurrency under ConcRT A Task is a unit of work that performs a specific job Tasks can be further broken down into more fine grain tasks (fork and join on “child” tasks) Tasks are kinds like very light weight Threads Threads normally reserve 1MB of memory for their stacks. Thread context switches eat processing time reducing throughput

Work Stealing Processor #1 Task #1 Processor #2 When a running task creates additional tasks it adds them to the bottom of the queue for the current Processor. If another Processor does not have any tasks in its queue it will steal a task from the top of another Processor’s queue (the top of the queue is the least likely to still be in the other Processor’s Cache). Task #2Task #3 Task #2 Task #1

Synchronization Data Structures Concurrency::critical_section Cooperative mutual exclusion object (yields to other tasks instead of preemting) Concurrency::reader_writer_lock Only allows a single writer Allows multiple readers if no writers Concurrency::scoped_lock and Concurrency::scoped_read_lock RAII locking for critical_section and reader_writer_lock Concurrency::event Allows Tasks to signal each other that an Event has occurred

Potential Concurrency Potential Concurrency is the concurrency that your application could have if computer could utilize it. Tasks are lightweight so that they are “cheap” to create. This allows you create many tasks to express the Potential Concurrency of your program. In other words … expressing the Potential Concurrency of your application Future Proofs your application!

Parallel Patterns Library Overview Task Parallelism Tasks and Task Groups Concurrency::task_group Concurrency::structured_task_group Parallel Algorithms Concurrency::parallel_for Concurrency::parallel_for_each Concurrency::parallel_invoke Parallel Containers and Objects Concurrency::concurrent_vector Concurrency::concurrent_queue Concurrency::combinable

PPL Task Groups Tasks are grouped by the task group they are created within. A tasks is cancelled as a group This is useful for operations such a search, where once the item searched for is found then all tasks that are searching should be canceled. Note that if a Task Group is cancelled while waiting on anther Task Group to complete the Task Group that is waiting will also be cancelled.

PPL Algorithms Today Concurrency::parallel_for Performs parallel tasks using iteration values (much like a normal for loop) Concurrency::parallel_for_each Performs parallel tasks for each item in an iterator range (much like std::for_each) Concurrency::parallel_invoke Executes a set of tasks in parallel PPL algorithms do not return until all the tasks within them complete or are canceled.

ConcRT Extras and Sample Pack Microsoft has released the ConcRT Extras and Sample Pack to give early access to new enhancements to the ConcRT before the next version of VC++. The ConcRT Extras and Sample Pack can be downloaded at: These are Template Libraries, so only need to include the header files. Microsoft has stated they encourage users to not only use, but to modify the Libraries to learn more.

Upcoming PPL Algorithms Currently Available as part of the ConcRT Sample Pack Concurrency::parallel_transform Concurrency::parallel_reduce Concurrency::parallel_sort Concurrency::parallel_buffered_sort Concurrency::parallel_radixsort Parallel Partitioners These have been announced to be part of vNext cing-the-ppl-agents-and-concrt-efforts-for-v-next.aspx

PPL Containers and Objects Concurrency::concurrent_vector Provides Concurrent Safe Random Access, Element Access, Iterator Access/Transversal Append Does Not Provide Deletion Of Elements Concurrency::concurrent_queue Provides Concurrent Safe Enqueue and Dequeue operations Concurrency::combinable Reuseable Thread Local Storage Allows Associative Operations to be combined at the end of a parallel_for, parallel_for_each, etc.

Upcoming PPL Containers Currently Available as part of the ConcRT Sample Pack concurrent_unordered_map concurrent_unordered_multimap concurrent_unordered_set concurrent_unordered_multiset Like the new algorithms these new containers have been announced to be part of vNext cing-the-ppl-agents-and-concrt-efforts-for-v-next.aspx

When To Use PPL When you have reasonably large tasks that can be processed in parallel This often requires that you change your algorithm to be parallel-able (for example using combinable ) It is easy to change your existing code to use PPL to accomplish: Parallel Sorts Parallel Sums/Counts/Averages (use Combinable ) Parallel Map/Reduce

PPL Best Practices From MSDN - Do Not Parallelize Small Loop Bodies Express Parallelism at the Highest Possible Level Use parallel_invoke to Solve Divide-and-Conquer Problems Use Cancellation or Exception Handling to Break from a Parallel Loop Understand how Cancellation and Exception Handling Affect Object Destruction Do Not Block Repeatedly in a Parallel Loop Do Not Perform Blocking Operations When You Cancel Parallel Work Do Not Write to Shared Data in a Parallel Loop When Possible, Avoid False Sharing Make Sure That Variables Are Valid Throughout the Lifetime of a Task

Using the PPL to parallelize loops

Asynchronous Agents Overview According to MSDN: An asynchronous agent (or just agent) is an application component that works asynchronously with other agents to solve larger computing tasks. Read File From Disk Decompress Input Data Process File Data Compress Output Data Decrypt Input Data Encrypt Output Data Transmit Output Data

Agent Message Passing Programming Model Message Passing Based “Life Cycle” Pattern Asynchronous Message Blocks Concurrency::unbounded_buffer Concurrency::overwrite_buffer Concurrency::single_assignment Message Passing Functions Concurrency::send Concurrency::asend Concurrency::receive Concurrency::try_receive

Agent Message Passing Diagram (Diagram taken from MSDN

When to use Asynchronous Agents When you have multiple processing steps that can work in parallel to process data as a pipeline (i.e. when you can arrange your code to work as an assembly line such that you can achieve parallelism) Examples: Image Processing Large Calculations That Build Upon Previous Calculations

Programming the GPU using AMP

The Power of Heterogeneous Computing 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding HD video stream to H X Simulation in Matlab using.mex file CUDA function 100X Astrophysics N-body simulation 149X Financial simulation of LIBOR model with swaptions 47X An M- script API for linear Algebra operations on GPU 20X Ultrasound medical imaging for cancer diagnostics 24X Highly optimized object oriented molecular dynamics 30X Cmatch exact string matching to find similar proteins and gene sequences source *Used with permission from Daniel Moth’s “Taming GPU compute with C++ Accelerated Massive Parallelism”

CPUs vs GPUs today CPU Low memory bandwidth Higher power consumption Medium level of parallelism Deep execution pipelines Random accesses Supports general code Mainstream programming GPU High memory bandwidth Lower power consumption High level of parallelism Shallow execution pipelines Sequential accesses Supports data-parallel code Niche programming images source: AMD *Used with permission from Daniel Moth’s “Taming GPU compute with C++ Accelerated Massive Parallelism”

C++ AMP Accelerated Massive Parallelism Best for Data Parallelism Bring GPGPU to the Masses Write C++ Code that runs on the GPU Available as part of the Visual Studio 2011 Developer Preview When running VS11 on Win8 there is even GPGPU debugging! Microsoft is submitting it as an Open Specification Several other compiler vendors have committed to implementing AMP.

Hello World: Array Addition #include using namespace concurrency; void AddArrays(int n, int * pA, int * pB, int * pC) { array_view a(n, pA); array_view b(n, pB); array_view sum(n, pC); parallel_for_each( sum.grid, [=](index i) restrict(direct3d) { sum[i] = a[i] + b[i]; } ); } void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; } *Used with permission from Daniel Moth’s “Taming GPU compute with C++ Accelerated Massive Parallelism”

For your reference

General C++ Links Microsoft’s MSDN C++ Developer Center CPlusPlus.com (Great Site for quick refernce to C++ and STL) Visual Studio Team Blog Herb Sutter’s Blog (ISO C++ Chairman and Microsoft Software Architect)

Parallel Programming in Native Code Blog Best Way To Stay Up To Date Parallel Programming in Native Code Blog Great tutorials and more How to pick your parallel sort? parallel-sort.aspx concurrent_vector and concurrent_queue explained and-concurrent-queue-explained.aspx Synchronization with the Concurrency Runtime (2 parts) with-the-concurrency-runtime.aspx Resource Management in the Concurrency Runtime (3 parts) management-in-the-concurrency-runtime-part-1.aspx

ConcRT Written Resources MSDN - Concurrency Runtime ConcRT Extras Parallel Programming with Microsoft Visual C++ (Free Book Online, PBook and EBook not free) Introducing the Visual C++ Concurrency Runtime (59 page hands on lab) Parallel Programming in Native Code Blog

ConcRT Video Resources Don McCrady - Parallelism in C++ Using the Concurrency Runtime Concurrency-Runtime The Concurrency Runtime: Fine Grained Parallelism for C++ Grained-Parallelism-for-C Parallel Programming for C++ Developers: Tasks and Continuations (2 Parts) Native-Code-Tasks-and-Continuations-Part-1-of-2 Native Parallelism with the Parallel Patterns Library parallel-patterns-library

AMP Resources Herb Sutter: Heterogeneous Computing and C++ AMP (Learn about the future of computing) Heterogeneous-Computing-and-C-AMP Taming GPU compute with C++ AMP Walkthrough: Debugging an AMP Application Daniel Moth’s Blog (AMP Project Manager)

Conclusions C++ is a Modern Language C++ is the language of choice to: Maximize Speed Minimize Power Consumption Target the latest hardware Have full control of your application Native Concurrency using C++ PPL, Agents, and AMP provide a powerful set of tools to enable you to unlock your potential concurrency!!! C++ is AMPed!!!

Please fill out a evaluation form before you leave! If you would like a copy of this slide deck please me at If you would more information please contact me or better yet, come to either the local C++ User Groups: Houston C++ User Group (1 st Thursday each month) University of Houston C++ User Group (Wednesday before 1 st Thursday each month)