DEV490 Easy Multi-threading for Native.NET Apps with OpenMP ™ and Intel ® Threading Toolkit Software Application Engineer, Intel.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Introduction to Openmp & openACC
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
Process Concept An operating system executes a variety of programs
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
© 2004, D. J. Foreman 2-1 Concurrency, Processes and Threads.
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP China MCP.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
4.1 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 4: Threads Overview Multithreading Models Thread Libraries  Pthreads  Windows.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
© 2004, D. J. Foreman 2-1 Concurrency, Processes and Threads.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Hybrid MPI and OpenMP Parallel Programming
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
CS333 Intro to Operating Systems Jonathan Walpole.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition, Chapter 4: Multithreaded Programming.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
CIS250 OPERATING SYSTEMS Chapter One Introduction.
Threaded Programming Lecture 2: Introduction to OpenMP.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
CS427 Multicore Architecture and Parallel Computing
Concurrency, Processes and Threads
Computer Structure Multi-Threading
INTEL HYPER THREADING TECHNOLOGY
Computer Engg, IIT(BHU)
Introduction to OpenMP
Hyperthreading Technology
Chapter 4: Threads Overview Multithreading Models Thread Libraries
Chapter 4: Threads.
Levels of Parallelism within a Single Processor
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
Levels of Parallelism within a Single Processor
Concurrency, Processes and Threads
Shared-Memory Paradigm & OpenMP
Presentation transcript:

DEV490 Easy Multi-threading for Native.NET Apps with OpenMP ™ and Intel ® Threading Toolkit Software Application Engineer, Intel EMEA

Agenda Hyper-Threading technology and What to do about it The answer is: multithreading! Introduction to OpenMP ™ Maximizing performance with OpenMP and Intel ® Threading Tools

How HT Technology Works Physical processors Logical processors visible to OS Physical processor resource allocation Throughput Time Resource 1 Resource 2 Resource 3 Thread 2 Thread 1 Without Hyper- Threading With Hyper- Threading Thread 2 Thread 1 Resource 1 Resource 2 Resource 3 + Higher resource utilization, higher output with two simultaneous threads

HT Technology, Not Magic Arch State MultiprocessorHyper-Threading Data Caches CPU Front- End Out-of- Order Execution Engine Data Caches CPU Front- End Out-of- Order Execution Engine Data Caches CPU Front- End Out-of- Order Execution Engine HT Technology increases processor performance by improving resource utilization

HT Technology: What’s Inside Instruction TLB Next Instruction PointerInstruction Streaming Buffers Trace Cache Fill Buffers Register Alias Tables Trace Cache Next IP Return Stack Predictor HT Technology increases processor performance at extremely low cost

Taking Advantage of HT HT is transparent to OS and apps Software simply “sees” multiple CPUs Software usage scenarios to benefit from HT: Multithreading – inside one application Multitasking – among several applications OS support enhances HT benefits: smart scheduling to account for logical CPUs halting logical CPUs in the idle loop implemented in: Windows* XP, Linux* 2.4.x To take advantage of HT Technology, multithread your application!

What? Multithreading Strategy How? Multithreading Implementation Multithreading Your Application

Multithreading Strategy Exploit Task Parallelism Exploit Data Parallelism

Exploiting Task Parallelism Partition work into disjoint tasks Execute the tasks concurrently (on separate threads) Compress Thread Encrypt Thread Data

Exploiting Data Parallelism Thread A Thread N … Partition data into disjoint sets Assigns the sets to concurrent threads

Multithreading Implementation API / Library Win32* threading API P-threads MPI Programming language mechanisms Java* C# Programming language extension OpenMP ™

What is OpenMP ™ ? Compiler extension for easy multithreading Ideally WITHOUT CHANGING C/C++ CODE Three components: #pragma s (compiler directives) – most important API and Runtime Library Environment Variables Benefits: Easy way to exploit Hyper-Threading Portable and standardized Allows incremental parallelization Dedicated profiling tools

Programming Model Fork-Join Parallelism: main thread creates a team of additional threads team threads join at the end of parallel region All memory (variables) is shared by default Parallelism can be added incrementally sequential program evolves into parallel parallel regions team of threads main thread

Our First OpenMP ™ Program All OpenMP ™ pragmas start with omp Pragma parallel starts team of n threads Everything inside is executed n times! All memory is shared! void func() { int a, b, c, d; a = a + b; c = c + d; } void func() { int a, b, c, d; #pragma omp parallel { a = a + b; c = c + d; }

Task Assignment to Threads Work-sharing pragmas assign tasks to threads Pragma sections assigns non-iterative tasks void func() { int a, b, c, d; #pragma omp parallel { #pragma omp sections { #pragma omp section a = a + b; //one thread #pragma omp section c = c + d; //another thread }

Work-sharing Pragmas Work-sharing pragmas can be merged with pragma parallel void func() { int a, b, c, d; #pragma omp parallel sections { #pragma omp section a = a + b; //one thread #pragma omp section c = c + d; //another thread } #pragma omp parallel sections { #pragma omp section do_compress(); #pragma omp section do_encrypt(); } Two functions execute in parallel: #pragma omp parallel sections implements Task-Level Parallelism!

Work-sharing for Loops Pragma for assigns loop iterations to threads There are many ways to do this… By the way, what does the example do?? void func() { int a[N], sum = 0; #pragma omp parallel for for (int i = 0; i < N; i++) { sum += a[i]; } #pragma omp parallel for implements Data-Level Parallelism!

Variable Scope Rules Implicit Rule 1: All variables defined outside omp parallel are shared by all threads Implicit Rule 2: All variables defined inside omp parallel are local to every thread Implicit Exception: In omp for, the loop counter is always local to every thread Explicit Rule 1: Variables listed in shared() clause are shared by all threads Explicit Rule 2: Variables listed in private() clause are local to every threads

What’s private, what’s shared? void func() { int a, i; #pragma omp parallel for \ shared(c) private(d, e) for (i = 0; i < N; i++) { int b, c, d, e; a = a + b; c = c + d * e; }

Synchronization Problem sum++; mov eax, [sum] add eax, 1 mov [sum], eax mov eax, [sum] add eax, 1 mov [sum], eax s s+1 s s s thread 1 executes thread 2 executes thrd 1’s eax thrd 2’s eax s sum t+0 t+1 t+2 clks s+1 C code CPU instructions mov [sum], eax s+1 t+3 Not what we expected!

Synchronization Pragmas #pragma omp single – execute next operator by one (random) thread only #pragma omp barrier – hold threads here until all arrive at this point #pragma omp atomic – executes next memory access op atomically (i.e. w/o interruption from other threads) #pragma omp critical [ name ] – let only one thread at a time execute next op int a[N], sum = 0; #pragma omp parallel for for (int i = 0; i < N; i++) { #pragma omp critical sum += a[i]; // one thread at a time }

omp for schedule Schemes schedule clause defines how loop iterations assigned to threads Compromise between two opposite goals: Best thread load balancing With minimal controlling overhead N N/2 C f single thread schedule(static) schedule(dynamic, c) schedule(guided, f)

Reduction Can we do smarter than synchronizing all accesses to one common accumulator? Alternative: private accumulator in every thread to produce the final value, all private accumulators summed up (reduced) in the end Reduction ops: +, *, -, &, ^, |, &&, || int a[N], sum = 0; #pragma omp parallel for reduction(+: sum) for (int i = 0; i < N; i++) { sum += a[i]; // no synchronization needed }

OpenMP ™ Quiz What do these things do?? omp parallel omp sections omp for omp parallel for private shared reduction schedule omp critical omp barrier

OpenMP ™ Pragma Cheatsheet Fundmental pragma, starts team of threads omp parallel Work-sharing pragmas (can merge w parallel) omp sections omp for Clauses for work-sharing pragmas private, shared, reduction – variable’s scope schedule – scheduling control Synchronization pragmas omp critical [name] omp barrier

What Else Is in OpenMP ™ ? Advanced variable scope and initialization Advanced synchronization pragmas OpenMP ™ API and environment variables control number of threads manual low-level synchronization… Task queue model of work-sharing Intel-specific extension until standardized Compatibility with Win32* and P-threads 

Task Queue Work-sharing For more complex iteration models beyond counted for-loops struct Node { int data; Node* next; }; Node* head; for (Node* p = head; p != NULL; p = p->next) { process(p->data); } Node* p; #pragma intel omp taskq shared(p) for (p = head; p != NULL; p = p->next) //only 1 thrd { #pragma intel omp task process(p->data); // queue task to be executed } // on any available thread

Related TechEd Sessions DEV490: Easy Multi-threading for Native Microsoft®.NET Apps With OpenMP ™ and Intel ® Threading Toolkit HOLINT02: Multithreading for Microsoft ®.NET Made Easy With OpenMP ™ and Intel ® Threading Toolkit HOLINT01: Using Hyper-Threading Technology to enhance your native and managed Microsoft ®.NET applications

Summary & Call to Action Hyper-Threading Technology increases processor performance at very low cost HT Technology is available on mass consumer desktop today To take advantage of HT Technology, multithread your application! Use OpenMP ™ to multithread your application incrementally and easily! Use Intel Threading Tools to achieve maximum performance

Resources  developer.intel.com/software/products/ compilers/  Intel ® Compiler User Guide

THANK YOU! developer.intel.com Please remember to complete your online Evaluation Form!!

Reference

Quest to Exploit Parallelism Level of Parallelism Exploited byTechnologyProblem InstructionMultiple, pipelined, out- of-order exec. units, branch prediction NetBurst™ microarch. (pipelined, superscalar, out- of-order, speculative) Utilization DataSIMDMMX, SSE, SSE2 Scope Task, ThreadMultiple processors SMPPrice

Watch HT in Action! In-Order PipelineOut-of-Order PipelineIn- Order

Intel ® Compiler Switches for Multithreading & OpenMP ™ Automatic parallelization /Qparallel OpenMP ™ support /Qopenmp /Qopenmp_report{0|1|2}

Key Terms Multiprocessing (MP) hardware technology to increase processor performance by increasing number of CPUs Hyper-Threading Technology (HT) hardware technology to increase processor performance by improving CPU utilization Multithreading (MT) software technology to improve software functionality & increase software performance by utilizing multiple (logical) CPUs

Community Resources Most Valuable Professional (MVP) Newsgroups Converse online with Microsoft Newsgroups, including Worldwide User Groups Meet and learn with your peers

evaluations evaluations

© 2003 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.