1 Multithreaded Programming Concepts 2010. 3. 12 Myongji University Sugwon Hong 1.

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Chapter Hardwired vs Microprogrammed Control Multithreading

Chapter 17 Parallel Processing.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.

Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

1 Advanced Computer Programming Concurrency Multithreaded Programs Copyright © Texas Education Agency, 2013.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Chapter 3 Memory Management: Virtual Memory

Computer System Architectures Computer System Software

Multi-core Programming: Basic Concepts. Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered.

Performance Evaluation of Parallel Processing. Why Performance?

Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.

INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Lecture 13: Multiprocessors Kai Bu

Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Shashwat Shriparv InfinitySoft.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Processor Architecture

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

1 How to do Multithreading First step: Sampling and Hotspot hunting Myongji University Sugwon Hong 1.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Concurrency and Performance Based on slides by Henri Casanova.

LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Processor Level Parallelism 1

lecture 5: CPU Scheduling

Parallel Processing - introduction

Multi-core processors

Cache Memory Presentation I

Morgan Kaufmann Publishers

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Hyperthreading Technology

CMPT 886: Computer Architecture Primer

Hardware Multithreading

Overview Parallel Processing Pipelining

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Chapter 11: Alternative Architectures

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

CSC3050 – Computer Architecture

Chapter 4 Multiprocessors

Hardware Multithreading

Presentation transcript:

1 Multithreaded Programming Concepts Myongji University Sugwon Hong 1

2 Why Multi-Core? Until recently increasing clock frequency is the holy grail to all processor designers to boost performance. But it seems that they reach the dead end for raising clock speed because of power consumption and overheating. So, they realize that it is much more efficient to run several cores at a lower frequency than one single core at a much faster frequency.

3 Power and Frequency (source : Intel Academy program)

4 A little bit of history In the past, performance scaling in single- core processors was achieved by increasing the clock frequency. When processors shrink and clock frequencies rise, Excess power consumption, and overheating Memory access time failed to keep pace with increasing clock frequencies.

5 Instruction/data-level parallelism Since 1993, processor designers supported parallel execution at instruction and data level. Instruction-level parallelism Out-of-order execution pipeline and multiple functional units to execute instructions in parallel Data-level parallelism Multimedia Extension (MMX) in 1997 Streaming SIMD Extension (SSE)

6 Hyper-Threading In 2002, Intel utilized additional copies of execution resources to execute two separate threads simultaneously on the same processor core. This multi-threading idea eventually lead to introducing dual-core processor in 2005.

7 Evolution of Multi-Core Technology (source : Intel Academy program)

8 Multi-processors Architecture Shared memory multiprocessor (SMP) Non-shared memory architecture Massively Parallel Processor (MPP) Cluster CPU Shared memory SMP CPU Interconnected memory MPP

9 Multi-processors vs. Multi-cores Shared memory multi-processors (SMP) Multiple thread on a single core (SMT) Multiple thread on multi-cores (CMT) Tricky acronym CMP (Chip Multi-processor) SMT (Simultaneous MultiThreading) CMT (Chip-level MultiThreading)

10 CMT processor products 1 st generation: Sun Microsystems (late 2005) Intel Dual-Core Xeon (2005) Intel Quad-Core Xeon (late 2006) AMD Quad-Core Opteron (2007) 8-Core (??)

11 Thread A thread is a sequential flow of instructions executed within a program. Thread vs. Process A single process always has one main thread which initialize the process and begins executing the instructions. Any thread can create other threads within a process which share code and data segments. But each thread has its own stack.

12 Thread in a Process process

13 Why use threads? Threads are intended to improve performance and responsiveness of a program. Quick turnaroud time Completing a single job in the smallest amount of time possible High throughput Finishing the most tasks in a fixed amount of time

14 Risks of using Threads But if they are not used properly, they can lead to degrade performance, and sometimes unpredictable behavior, and error conditions Data race (race conditions) Deadlock And other extra burdens. Code complexity Portability issues Testing and debugging difficulty

15 Race condition It happens when more than two threads access a shared variable. “It is nondeterministic!” For example, when Tread A and Tread B are executing the statement. area = area / (1.0 + x*x)

16 (source : Intel Academy program)

17 How to deal with race condition Synchronization Critical region Mutual exclusion

18 Concurrency vs. Parallelism Generally two terminologies can be used interchangeably. But conventional wisdom has the following distinction. Concurrency It happens when more than two threads are in progress simultaneously, normally on a single processor. Parallelism It occurs when more than two threads are executed simultaneously on multiple cores.

19 Performance criteria Speedup Efficiency Granularity Load balance

20 Speedup The most noticeable quantitative measure is to compare the execution time of the best serial algorithm with that of the parallel algorithm. Speedup = Ts/Tp Ts = Serial Time, Tp = Parallel Time Amdahl’s Law Speedup = 1/[S+(1-S)/n + H(n)] S: percentage of time spent on executing the serial portion H(n) : parallel overhead n: the number of cores

21 Example Consider painting a fence. Suppose it takes 30 min to get ready to paint and 30 min for cleanup after painting. Assume that it takes 1 min to paint one single picket and there are 300 pickets. What are the speedups when 1, 2, 10, 100 painters do this job respectively? What is the maximum speedup? What if you use a spray gun to paint the fence? What happens if the fence owner uses spray gun to paint 300 pickets in 1 hrs?

22 Parallel Efficiency A measure of how efficiently core resources are used during parallel computations In the previous example, assume that you knew that all painters were only busy for an average of less than 6% of entire job time but are still getting paid for the whole time. Do you think you were getting the money’s worth from the 100 painters? Efficiency = (Speedup / Number of Threads) * 100%

23 Granularity The ratio of computation to synchronization Coarse-grained Concurrent threads have a large amount of computation between synchronization events. Fine-grained Concurrent threads have a very little computation between synchronization events.

24 Load Balance Balancing the workloads among multiple threads If more work is assigned to some threads, they will sit idle until other threads with more work finish. All the cores must be busy to get max. performance. For load balancing, which size of task will be better? Large-sized or small-sized?

25 Flash Demo demo

26 Computer Memory Hierarchy CPU L1 cache L2 cache Main memory disk 1’s cycle 1’s ~10 cycle ~100’s cycle ~1000’s cycle

27 Architecture consideration(1) In order to obtain better performance, we need to understand how the work is done inside. Cache Cache line (cache block, e.g. 64bytes) Data moves between memory and caches in cache line. Shared caches or separate caches between cores Cache miss is very costly. Cache coherency when they are separate. Replacement policies such as LRU

28 Architecture consideration(2) Memory management Paging Translation look-aside table (TLB) Inside CPU Registers

29 False sharing Assume the cache line is 64 bytes. What happens if two threads try to execute at the same time? Thread 1 int a[1000]; int b[1000]; while a[998] = i * 1000; Thread 2 int a[1000]; int b[1000]; while b[0] = i;

30 Poor cache utilization What is the difference between the following two codes? int a[1000][1000]; for (i=0; i<100; ++i) for (j=0; j<1000; ++j) a[i][j] = i*j; int b[1000][1000]; for (i=0; i<100; ++i) for (j=0; j<1000; ++j) b[j][i] = i*j;

31 Poor Cache Utilization - with eggs (source : Intel Academy program)

32 Good Cache Utilization – with eggs (source : Intel Academy program)