CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Distributed Systems CS
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
Mendel Rosenblum and John K. Ousterhout Presented by Travis Bale 1.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
This module created with support form NSF under grant # DUE Module developed Spring 2013 By Wuxu Peng Parallel Performance – Basic Concepts.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Memory Management Norman White Stern School of Business.
CS 584. Logic The art of thinking and reasoning in strict accordance with the limitations and incapacities of the human misunderstanding. The basis of.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.
5: CPU-Scheduling1 Jerry Breecher OPERATING SYSTEMS SCHEDULING.
Paging for Multi-Core Shared Caches Alejandro López-Ortiz, Alejandro Salinger ITCS, January 8 th, 2012.
Parallel Programming in C with MPI and OpenMP
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.
OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.
Performance Evaluation of Parallel Processing. Why Performance?
1 Previous lecture review n Out of basic scheduling techniques none is a clear winner: u FCFS - simple but unfair u RR - more overhead than FCFS may not.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
Time Parallel Simulations II ATM Multiplexers and G/G/1 Queues.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.
Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
CS 284a, 29 October 1997 Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 29 October, 1997.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Parallel and Distributed Simulation Time Parallel Simulation.
Department of Computer Science and Software Engineering
Advanced Computer Networks Lecture 1 - Parallelization 1.
A System Performance Model Distributed Process Scheduling.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
CSCI-455/552 Introduction to High Performance Computing Lecture 21.
Concurrency and Performance Based on slides by Henri Casanova.
Tuning Threaded Code with Intel® Parallel Amplifier.
Fall 2000M.B. Ibáñez Lecture 14 Memory Management II Contiguous Allocation.
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Scheduling.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Distributed and Parallel Processing George Wells.
Copyright ©: Nahrstedt, Angrave, Abdelzaher
4- Performance Analysis of Parallel Programs
Copyright ©: Nahrstedt, Angrave, Abdelzaher
6/16/2010 Parallel Performance Parallel Performance.
Chapter 4: Multithreaded Programming
EE 193: Parallel Computing
Chapter 3: Principles of Scalable Performance
CPU SCHEDULING.
Fast Communication and User Level Parallelism
Distributed Systems CS
Mattan Erez The University of Texas at Austin
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

CS 284a, 4 November 1997 Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997

CS 284a, 4 November 1997 Copyright (c) , John Thornley2 PSRS Summary Step 1: O(k) - Divide data into segments. Step 2: O(n log 2 (n/k)) - Sort data segments. Step 3: O(2k 2 ) - Sample sorted data segments. Step 4: O(2k 2 log 2 (2k 2 )) - Sort data sample. Step 5: O(k) - Choose pivots from sorted data sample. Step 6: O(k 2 log 2 (n/k))- Partition sorted data segments. Step 7: O(k 2 ) - Compute result partition sizes. Step 8: O(n log 2 (k)) - Merge data into result partitions. Notes: Almost all the time is in steps 2 and 8. Steps increase likelihood of good pivot choice. O(nk) version of step 8 would be more simple.

CS 284a, 4 November 1997 Copyright (c) , John Thornley3 The Key Algorithm: Sequential K-Way Merge (Step 8) data sequential k-way merge result sequential k-way merge sequential k-way merge Sequential complexity: O(n log 2 (k)) or O(nk)

CS 284a, 4 November 1997 Copyright (c) , John Thornley4 PSRS: Multithreaded Performance Issues Load Balance: –How evenly-sized will the partitions be? –What is the data is not uniformly distributed? –What if there are lots of duplicates in the data? –Can we solve load balancing by having k > t? Algorithm Overhead: –How does sequential performance compare with quicksort? –How does sequential performance depend on k? Multithreading: –What is the cost of thread creation? Should we use barriers? –What are the cache/memory access issues?

CS 284a, 4 November 1997 Copyright (c) , John Thornley5 Multithreading Performance Issues Threading maintenance overheads. Load balancing. Granularity. Memory contention. Underlying algorithm.

CS 284a, 4 November 1997 Copyright (c) , John Thornley6 Multithreading Performance Issues: Thread Maintenance Overheads Thread creation and termination costs. Thread scheduling costs. Thread synchronization costs.

CS 284a, 4 November 1997 Copyright (c) , John Thornley7 Thread Creation and Termination Costs #pragma multithreadable mapping(blocked(4)) for (i = 0; i < N, i++) f(i); sequential startup/shutdown overhead creation/termination overhead Ideal Reality

CS 284a, 4 November 1997 Copyright (c) , John Thornley8 Thread Scheduling Model pool of runnable threads pool of processors pool of suspended threads

CS 284a, 4 November 1997 Copyright (c) , John Thornley9 Thread Scheduling Costs Threads can be either runnable or suspended. Scheduling policy: –Runnable threads replace suspended threads. –High-priority runnable threads replace low-priority runnable threads. –(Preemption) Idle runnable threads replace running runnable threads. Thread switch takes time. Cache reloading takes time.

CS 284a, 4 November 1997 Copyright (c) , John Thornley10 Thread Synchronization Costs Ideal barrierReal barrier departure overhead arrival overhead

CS 284a, 4 November 1997 Copyright (c) , John Thornley11 Multithreading Performance Issues: Load Balancing Load balancing = keeping all processors busy. Large number of small, equal-sized threads gives better load balance. processor time processor time 9 threads 18 threads

CS 284a, 4 November 1997 Copyright (c) , John Thornley12 Multithreading Performance Issues: Granularity Granularity = measure of amount of computation between threading and synchronization operations. Fine-grained = little computation between operations. Coarse-grained = lots of computation between operations. Balance required: –Too fine-grained = too much threading overhead. –Too coarse-grained = poor load balancing.

CS 284a, 4 November 1997 Copyright (c) , John Thornley13 Multithreading Performance Issues: Memory Contention Cache misses are very expensive - can cause memory contention. Rewrite program to increase memory access locality. Optimization of sequential cache behavior will minimize multithreaded memory contention. This is a big advantage of our “multithreaded like sequential” development methodology. Trap to watch out for - “false sharing”.

CS 284a, 4 November 1997 Copyright (c) , John Thornley14 Multithreading Performance Issues: Underlying Algorithm Sequential component limits speedup (Amdahl’s law). Multithreaded algorithm may be less efficient than best sequential algorithm. Partitioning problem may increase total workload (e.g., PSRS algorithm). Partitioning problem may decrease total workload (e.g., route optimization). Partitioning problem will change cache behavior - often for the better.

CS 284a, 4 November 1997 Copyright (c) , John Thornley15 Multithreading Performance Issues: Summary No magic answer to obtaining performance. Granularity is very important: –Choose enough threads to provide good load balance. –But, make sure threads are not too fine-grained. Memory access patterns are very important: –Optimize cache behavior in sequential interpretation. –Watch out for false sharing. Underlying algorithm is very important: –No point speeding up a slow algorithm. –Consider effect of multithreading on total workload. Fortunately, Windows NT is very flexible.