Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lecture 2 (Mapping Applications to Multi-core.

Similar presentations

Presentation on theme: "Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lecture 2 (Mapping Applications to Multi-core."— Presentation transcript:

1 Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lecture 2 (Mapping Applications to Multi-core Arch)

2 KICS, UET Cavium Univ Program © Course Outline Introduction Multi-threading on multi-core processors Developing parallel applications Introduction to POSIX based multi-threading Multi-threaded application examples Applications for multi-core processors Application layer computing on multi-core Performance measurement and tuning

3 KICS, UET Cavium Univ Program © Agenda for Today Mapping applications to multi-core applications Parallel programming using threads POSIX multi-threading Using multi-threading for parallel programming

4 Mapping Applications to Multi-Core Architectures Chapter 2 David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998

5 KICS, UET Cavium Univ Program © Parallelization Assumption: Sequential algorithm is given Sometimes need very different algorithm, but beyond scope Pieces of the job: Identify work that can be done in parallel Partition work and perhaps data among processes Manage data access, communication and synchronization Note: work includes computation, data access and I/O Main goal: Speedup (plus low prog. effort and resource needs) Speedup (p) = Performance(p) / Performance(1) For a fixed problem: Speedup (p) = Time(1) / Time(p)

6 KICS, UETCavium Univ Program © Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...) Issues are the same, so assume programmer does it all explicitly

7 KICS, UET Cavium Univ Program © Some Important Concepts Task: Arbitrary piece of undecomposed work in parallel computation Executed sequentially; concurrency is only across tasks E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace Fine-grained versus coarse-grained tasks Process (thread): Abstract entity that performs the tasks assigned to processes Processes communicate and synchronize to perform their tasks Processor: Physical engine on which process executes Processes virtualize machine to programmer first write program in terms of processes, then map to processors

8 KICS, UET Cavium Univ Program © Decomposition Break up computation into tasks to be divided among processes Tasks may become available dynamically No. of available tasks may vary with time i.e., identify concurrency and decide level at which to exploit it Goal: Enough tasks to keep processes busy, but not too many No. of tasks available at a time is upper bound on achievable speedup

9 KICS, UET Cavium Univ Program © Limited Concurrency: Amdahl’s Law Most fundamental limitation on parallel speedup If fraction s of seq execution is inherently serial, speedup <= 1/s Example: 2-phase calculation sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum Time for first phase = n2/p Second phase serialized at global variable, so time = n2 Speedup <= or at most 2 Trick: divide second phase into two accumulate into private sum during sweep add per-process private sum into global sum Parallel time is n2/p + n2/p + p, and speedup at best 2n 2 n2n2 p + n 2 2n 2 2n 2 + p 2

10 KICS, UET Cavium Univ Program © Pictorial Depiction 1 p 1 p 1 n 2 /p n2n2 p work done concurrently n2n2 n2n2 Time n 2 /p (c) (b) (a)

11 KICS, UET Cavium Univ Program © Concurrency Profiles Cannot usually divide into serial and parallel part Area under curve is total work done, or time with 1 processor Horizontal extent is lower bound on time (infinite processors) Speedup is the ratio:, base case: Amdahl’s law applies to any overhead, not just limited concurrency f k k fkfk k p  k=1    1 s + 1-s p

12 KICS, UET Cavium Univ Program © Assignment Specifying mechanism to divide work up among processes E.g. which process computes forces on which stars, or which rays Together with decomposition, also called partitioning Balance workload, reduce communication and management cost Structured approaches usually work well Code inspection (parallel loops) or understanding of application Well-known heuristics Static versus dynamic assignment As programmers, we worry about partitioning first Usually independent of architecture or prog model But cost and complexity of using primitives may affect decisions As architects, we assume program does reasonable job of it

13 KICS, UET Cavium Univ Program © Orchestration Includes: Naming data Structuring communication Synchronization Organizing data structures and scheduling tasks temporally Goals Reduce cost of communication and synch. as seen by processors Reserve locality of data reference (incl. data structure organization) Schedule tasks to satisfy dependences early Reduce overhead of parallelism management Closest to architecture (and programming model & language) Choices depend a lot on comm. abstraction, efficiency of primitives Architects should provide appropriate primitives efficiently

14 KICS, UET Cavium Univ Program © Mapping After orchestration, already have parallel program Two aspects of mapping: Which processes will run on same processor, if necessary Which process runs on which particular processor mapping to a network topology One extreme: space-sharing Machine divided into subsets, only one app at a time in a subset Processes can be pinned to processors, or left to OS Another extreme: complete resource management control to OS OS uses the performance techniques we will discuss later Real world is between the two User specifies desires in some aspects, system may ignore Usually adopt the view: process processor

15 KICS, UET Cavium Univ Program © Parallelizing Computation vs. Data Above view is centered around computation Computation is decomposed and assigned (partitioned) Partitioning Data is often a natural view too Computation follows data: owner computes Grid example; data mining; High Performance Fortran (HPF) But not general enough Distinction between comp. and data stronger in many applications Barnes-Hut, Raytrace (later) Retain computation-centric view Data access and communication is part of orchestration

16 KICS, UET Cavium Univ Program © High-level Goals High performance (speedup over sequential program) But low resource usage and development effort Implications for algorithm designers and architects Algorithm designers: high-perf., low resource needs Architects: high-perf., low cost, reduced programming effort e.g. gradually improving perf. with programming effort may be preferable to sudden threshold after large programming effort

17 KICS, UET Cavium Univ Program © Parallelization of An Example Program Motivating problems all lead to large, complex programs Examine a simplified version of a piece of Ocean simulation Iterative equation solver Illustrate parallel program in low-level parallel language C-like pseudocode with simple extensions for parallelism Expose basic comm. and synch. primitives that must be supported State of most real parallel programming today

18 KICS, UET Cavium Univ Program © Grid Solver Example Simplified version of solver in Ocean simulation Gauss-Seidel (near-neighbor) sweeps to convergence interior n-by-n points of (n+2)-by-(n+2) updated in each sweep updates done in-place in grid, and diff. from prev. value computed accumulate partial diffs into global diff at end of every sweep check if error has converged (to within a tolerance parameter) if so, exit solver; if not, do another sweep

19 KICS, UET Cavium Univ Program ©

20 KICS, UET Cavium Univ Program © Decomposition Simple way to identify concurrency is to look at loop iterations dependence analysis; if not enough concurrency, then look further Not much concurrency here at this level (all loops sequential) Examine fundamental dependences, ignoring loop structure Concurrency O(n) along anti-diagonals, serialization O(n) along diag. Retain loop structure, use pt-to-pt synch; Problem: too many synch ops. Restructure loops, use global synch; imbalance and too much synch

21 KICS, UET Cavium Univ Program © Exploit Application Knowledge Reorder grid traversal: red-black ordering Different ordering of updates: may converge quicker or slower Red sweep and black sweep are each fully parallel: Global synch between them (conservative but convenient) Ocean uses red-black; we use simpler, asynchronous one to illustrate no red-black, simply ignore dependences within sweep sequential order same as original, parallel program nondeterministic

22 KICS, UET Cavium Univ Program © Decomposition Only Decomposition into elements: degree of concurrency n 2 To decompose into rows, make line 18 loop sequential; degree n for_all leaves assignment left to system but implicit global synch. at end of for_all loop

23 KICS, UET Cavium Univ Program © Assignment Static assignments (given decomposition into rows) block assignment of rows: Row i is assigned to process cyclic assignment of rows: process i is assigned rows i, i+p, and so on Dynamic assignment get a row index, work on the row, get a new row, and so on Static assignment into rows reduces concurrency (from n to p) block assign. reduces communication by keeping adjacent rows together Let’s dig into orchestration under three programming models i p

24 KICS, UET Cavium Univ Program © Data Parallel Solver

25 KICS, UET Cavium Univ Program © Shared Address Space Solver Assignment controlled by values of variables used as loop bounds Single Program Multiple Data (SPMD)

26 KICS, UET Cavium Univ Program ©

27 KICS, UET Cavium Univ Program © Notes on SAS Program SPMD: not all Code that does the update lockstep or even necessarily same instructions Assignment controlled by values of variables used as loop bounds unique pid per process, used to control assignment Done condition evaluated redundantly by identical to sequential program each process has private mydiff variable Most interesting special operations are for synchronization accumulations into shared diff have to be mutually exclusive why the need for all the barriers?

28 KICS, UET Cavium Univ Program © Need for Mutual Exclusion Code each process executes: load the value of diff into register r1 add the register r2 to register r1 store the value of register r1 into diff A possible interleaving: P1 P2 r1  diff {P1 gets 0 in its r1} r1  diff {P2 also gets 0} r1  r1+r2 {P1 sets its r1 to 1} r1  r1+r2 {P2 sets its r1 to 1} diff  r1 {P1 sets cell_cost to 1} diff  r1 {P2 also sets cell_cost to 1} Need the sets of operations to be atomic (mutually exclusive)

29 KICS, UET Cavium Univ Program © Global Event Synchronization BARRIER(nprocs): wait here till nprocs processes get here Built using lower level primitives Global sum example: wait for all to accumulate before using sum Often used to separate phases of computation Process P_1Process P_2Process P_nprocs set up eqn systemset up eqn systemset up eqn system Barrier (name, nprocs)Barrier (name, nprocs) Barrier (name, nprocs) solve eqn systemsolve eqn system solve eqn system Barrier (name, nprocs)Barrier (name, nprocs) Barrier (name, nprocs) apply resultsapply results apply results Barrier (name, nprocs)Barrier (name, nprocs) Barrier (name, nprocs) Conservative form of preserving dependences, but easy to use WAIT_FOR_END (nprocs-1)

30 KICS, UET Cavium Univ Program © Pt-to-pt Event Synch (Not Used Here) One process notifies another of an event so it can proceed Common example: producer-consumer (bounded buffer) Concurrent programming on uniprocessor: semaphores Shared address space parallel programs: semaphores, or use ordinary variables as flags Busy-waiting or spinning

31 KICS, UET Cavium Univ Program © Group Event Synchronization Subset of processes involved Can use flags or barriers (involving only the subset) Concept of producers and consumers Major types: Single-producer, multiple-consumer Multiple-producer, single-consumer

32 KICS, UET Cavium Univ Program © Message Passing Grid Solver Cannot declare A to be shared array any more Need to compose it logically from per-process private arrays usually allocated in accordance with the assignment of work process assigned a set of rows allocates them locally Transfers of entire rows between traversals Structurally similar to SAS (e.g. SPMD), but orchestration different data structures and data access/naming communication synchronization

33 KICS, UET Cavium Univ Program ©

34 KICS, UET Cavium Univ Program © Notes on Message Passing Program Use of ghost rows Receive does not transfer data, send does unlike SAS which is usually receiver-initiated (load fetches data) Communication done at beginning of iteration, so no asynchrony Communication in whole rows, not element at a time Core similar, but indices/bounds in local rather than global space Synchronization through sends and receives Update of global diff and event synch for done condition Could implement locks and barriers with messages Can use REDUCE and BROADCAST library calls to simplify code

35 KICS, UET Cavium Univ Program © Send and Receive Alternatives Can extend functionality: stride, scatter-gather, groups Semantic flavors: based on when control is returned Affect when data structures or buffers can be reused at either end Affect event synch (mutual excl. by fiat: only one process touches data) Affect ease of programming and performance Synchronous messages provide built-in synch. through match Separate event synchronization needed with asynch. messages With synch. messages, our code is deadlocked. Fix? Send/Receive SynchronousAsynchronous Blocking asynch.Nonblocking asynch.

36 KICS, UET Cavium Univ Program © Orchestration: Summary Shared address space Shared and private data explicitly separate Communication implicit in access patterns No correctness need for data distribution Synchronization via atomic operations on shared data Synchronization explicit and distinct from data communication Message passing Data distribution among local address spaces needed No explicit shared structures (implicit in comm. patterns) Communication is explicit Synchronization implicit in communication (at least in synch. case) mutual exclusion by fiat

37 KICS, UET Cavium Univ Program © Correctness in Grid Solver Program Decomposition and Assignment similar in SAS and message-passing Orchestration is different Data structures, data access/naming, communication, synchronization

38 Programming for Performance Chapter 3 David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998

39 KICS, UET Cavium Univ Program © Outline Programming techniques for performance Partitioning for performance Relationship of communication, data locality and architecture Programming for performance For each issue: Techniques to address it, and tradeoffs with previous issues Application to grid solver Some architectural implications Components of execution time as seen by processor What workload looks like to architecture, and relate to software issues Implications for programming models

40 KICS, UET Cavium Univ Program © Partitioning for Performance Balancing the workload and reducing wait time at synch points Reducing inherent communication Reducing extra work Even these algorithmic issues trade off: Minimize comm. => run on 1 processor => extreme load imbalance Maximize load balance => random assignment of tiny tasks => no control over communication Good partition may imply extra work to compute or manage it Goal is to compromise Fortunately, often not difficult in practice

41 KICS, UET Cavium Univ Program © Load Balance and Synch Wait Time Limit on speedup: Speedup problem (p) < Work includes data access and other costs Not just equal work, but must be busy at same time Four parts to load balance and reducing synch wait time: 1. Identify enough concurrency 2. Decide how to manage it 3. Determine the granularity at which to exploit it 4. Reduce serialization and cost of synchronization Sequential Work Max Work on any Processor

42 KICS, UET Cavium Univ Program © Identifying Concurrency Techniques seen for equation solver: Loop structure, fundamental dependences, new algorithms Data Parallelism versus Function Parallelism Often see orthogonal levels of parallelism; e.g. VLSI routing

43 KICS, UET Cavium Univ Program © Identifying Concurrency (Cont’d) Function parallelism: entire large tasks (procedures) that can be done in parallel on same or different data e.g. different independent grid computations in Ocean pipelining, as in video encoding/decoding, or polygon rendering degree usually modest and does not grow with input size difficult to load balance often used to reduce synch between data parallel phases Most scalable programs data parallel (per this loose definition) function parallelism reduces synch between data parallel phases

44 KICS, UET Cavium Univ Program © Deciding How to Manage Concurrency Static versus Dynamic techniques Static: Algorithmic assignment based on input; won’t change Low runtime overhead Computation must be predictable Preferable when applicable (except in multiprogrammed/heterogeneous environment) Dynamic: Adapt at runtime to balance load Can increase communication and reduce locality Can increase task management overheads

45 KICS, UET Cavium Univ Program © Dynamic Assignment Profile-based (semi-static): Profile work distribution at runtime, and repartition dynamically Applicable in many computations, e.g. Barnes-Hut, some graphics Dynamic Tasking: Deal with unpredictability in program or environment (e.g. Raytrace) computation, communication, and memory system interactions multiprogramming and heterogeneity used by runtime systems and OS too Pool of tasks; take and add tasks until done E.g. “self-scheduling” of loop iterations (shared loop counter)

46 KICS, UET Cavium Univ Program © Dynamic Tasking with Task Queues Centralized versus distributed queues Task stealing with distributed queues Can compromise comm and locality, and increase synchronization Whom to steal from, how many tasks to steal,... Termination detection Maximum imbalance related to size of task

47 KICS, UET Cavium Univ Program © Determining Task Granularity Task granularity: amount of work associated with a task General rule: Coarse-grained => often less load balance Fine-grained => more overhead; often more comm., contention Comm., contention actually affected by assignment, not size Overhead by size itself too, particularly with task queues

48 KICS, UET Cavium Univ Program © Reducing Serialization Careful about assignment and orchestration (including scheduling) Event synchronization Reduce use of conservative synchronization e.g. point-to-point instead of barriers, or granularity of pt-to-pt But fine-grained synch more difficult to program, more synch ops. Mutual exclusion Separate locks for separate data e.g. locking records in a database: lock per process, record, or field lock per task in task queue, not per queue finer grain => less contention/serialization, more space, less reuse Smaller, less frequent critical sections don’t do reading/testing in critical section, only modification e.g. searching for task to dequeue in task queue, building tree Stagger critical sections in time

49 KICS, UET Cavium Univ Program © Reducing Inherent Communication Communication is expensive! Measure: communication to computation ratio Focus here on inherent communication Determined by assignment of tasks to processes Later see that actual communication can be greater Assign tasks that access same data to same process Solving communication and load balance NP-hard in general case But simple heuristic solutions work well in practice Applications have structure!

50 KICS, UET Cavium Univ Program © Domain Decomposition Works well for scientific, engineering, graphics,... applications Exploits local-biased nature of physical problems Information requirements often short-range Or long-range but fall off with distance Simple example: nearest-neighbor grid computation Perimeter to Area comm-to-comp ratio (area to volume in 3-d) Depends on n,p: decreases with n, increases with p

51 KICS, UET Cavium Univ Program © Domain Decomposition (Cont’d) Comm to comp: for block, for strip Retain block from here on Application dependent: strip may be better in other cases E.g. particle flow in tunnel 4*√p n 2*p n Best domain decomposition depends on information requirements Nearest neighbor example: block versus strip decomposition:

52 KICS, UET Cavium Univ Program © Finding a Domain Decomposition Static, by inspection Must be predictable: grid example Static, but not by inspection Input-dependent, require analyzing input structure E.g sparse matrix computations, data mining (assigning itemsets) Semi-static (periodic repartitioning) Characteristics change but slowly; e.g. Barnes-Hut Static or semi-static, with dynamic task stealing Initial decomposition, but highly unpredictable; e.g ray tracing

53 KICS, UET Cavium Univ Program © Other Techniques Preserve locality in task stealing Steal large tasks for locality, steal from same queues,... Scatter Decomposition, e.g. initial partition in Raytrace

54 KICS, UET Cavium Univ Program © Implications of Comm-to-Comp Ratio Architects examine application needs to see where to spend money If denominator is execution time, ratio gives average BW needs If operation count, gives extremes in impact of latency and bandwidth Latency: assume no latency hiding Bandwidth: assume all latency hidden Reality is somewhere in between Actual impact of comm. depends on structure and cost as well Need to keep communication balanced across processors as well Sequential Work Max (Work + Synch Wait Time + Comm Cost) Speedup <

55 KICS, UET Cavium Univ Program © Reducing Extra Work Common sources of extra work: Computing a good partition e.g. partitioning in Barnes-Hut or sparse matrix Using redundant computation to avoid communication Task, data and process management overhead applications, languages, runtime systems, OS Imposing structure on communication coalescing messages, allowing effective naming Architectural Implications: Reduce need by making communication and orchestration efficient Sequential Work Max (Work + Synch Wait Time + Comm Cost + Extra Work) Speedup <

56 KICS, UET Cavium Univ Program © Memory-oriented View of Performance Multiprocessor as Extended Memory Hierarchy as seen by a given processor Levels in extended hierarchy: Registers, caches, local memory, remote memory (topology) Glued together by communication architecture Levels communicate at a certain granularity of data transfer Need to exploit spatial and temporal locality in hierarchy Otherwise extra communication may also be caused Especially important since communication is expensive

57 KICS, UET Cavium Univ Program © Uniprocessor Optimization Performance depends heavily on memory hierarchy Time spent by a program Time prog (1) = Busy(1) + Data Access(1) Divide by cycles to get CPI equation Data access time can be reduced by: Optimizing machine: bigger caches, lower latency... Optimizing program: temporal and spatial locality

58 KICS, UET Cavium Univ Program © Extended Hierarchy Idealized view: local cache hierarchy + single main memory But reality is more complex Centralized Memory: caches of other processors Distributed Memory: some local, some remote; + network topology Management of levels caches managed by hardware main memory depends on programming model SAS: data movement between local and remote transparent message passing: explicit Levels closer to processor are lower latency and higher bandwidth Improve performance through architecture or program locality Tradeoff with parallelism; need good node performance and parallelism

59 KICS, UET Cavium Univ Program © Artifactual Comm. in Extended Hierarchy Accesses not satisfied in local portion cause communication Inherent communication, implicit or explicit, causes transfers determined by program Artifactual communication determined by program implementation and arch. interactions poor allocation of data across distributed memories unnecessary data in a transfer unnecessary transfers due to system granularities redundant communication of data finite replication capacity (in cache or main memory) Inherent communication assumes unlimited capacity, small transfers, perfect knowledge of what is needed. More on artifactual later; first consider replication-induced further

60 KICS, UET Cavium Univ Program © Communication and Replication Comm induced by finite capacity is most fundamental artifact Like cache size and miss rate or memory traffic in uniprocessors Extended memory hierarchy view useful for this relationship View as three level hierarchy for simplicity Local cache, local memory, remote memory (ignore network topology) Classify “misses” in “cache” at any level as for uniprocessors compulsory or cold misses (no size effect) capacity misses (yes) conflict or collision misses (yes) communication or coherence misses (no) Each may be helped/hurt by large transfer granularity (spatial locality)

61 KICS, UET Cavium Univ Program © Orchestration for Performance Reducing amount of communication: Inherent: change logical data sharing patterns in algorithm Artifactual: exploit spatial, temporal locality in extended hierarchy Techniques often similar to those on uniprocessors Structuring communication to reduce cost Let’s examine techniques for both...

62 KICS, UET Cavium Univ Program © Reducing Artifactual Communication Message passing model Communication and replication are both explicit Even artifactual communication is in explicit messages Shared address space model More interesting from an architectural perspective Occurs transparently due to interactions of program and system sizes and granularities in extended memory hierarchy Use shared address space to illustrate issues

63 KICS, UET Cavium Univ Program © Exploiting Temporal Locality Structure algorithm so working sets map well to hierarchy often techniques to reduce inherent communication do well here schedule tasks for data reuse once assigned Multiple data structures in same phase e.g. database records: local versus remote Solver example: blocking More useful when O(n k+1) computation on O(n k ) data – many linear algebra computations (factorization, matrix multiply)

64 KICS, UET Cavium Univ Program © Exploiting Spatial Locality Besides capacity, granularities are important: Granularity of allocation Granularity of communication or data transfer Granularity of coherence Major spatial-related causes of artifactual communication: Conflict misses Data distribution/layout (allocation granularity) Fragmentation (communication granularity) False sharing of data (coherence granularity) All depend on how spatial access patterns interact with data structures Fix problems by modifying data structures, or layout/alignment Examine later in context of architectures one simple example here: data distribution in SAS solver

65 KICS, UET Cavium Univ Program © Spatial Locality Example Repeated sweeps over 2-d grid, each time adding 1 to elements Natural 2-d versus higher-dimensional array representation

66 KICS, UET Cavium Univ Program © Tradeoffs with Inherent Communication Partitioning grid solver: blocks versus rows Blocks still have a spatial locality problem on remote data Rowwise can perform better despite worse inherent c-to-c ratio

67 KICS, UET Cavium Univ Program © Example Performance Impact Equation solver on SGI Origin2000

68 KICS, UET Cavium Univ Program © Structuring Communication Given amount of comm (inherent or artifactual), goal is to reduce cost Cost of communication as seen by process: C = f * ( o + l + + t c - overlap) f = frequency of messages o = overhead per message (at both ends) l = network delay per message n c = total data sent m = number of messages B = bandwidth along path (determined by network, NI, assist) t c = cost induced by contention per message overlap = amount of latency hidden by overlap with comp. or comm. Portion in parentheses is cost of a message (as seen by processor) That portion, ignoring overlap, is latency of a message Goal: reduce terms in latency and increase overlap n c /m B

69 KICS, UET Cavium Univ Program © Reducing Overhead Can reduce no. of messages m or overhead per message o o is usually determined by hardware or system software Program should try to reduce m by coalescing messages More control when communication is explicit Coalescing data into larger messages: Easy for regular, coarse-grained communication Can be difficult for irregular, naturally fine-grained communication may require changes to algorithm and extra work coalescing data and determining what and to whom to send

70 KICS, UET Cavium Univ Program © Reducing Contention All resources have nonzero occupancy Memory, communication controller, network link, etc. Can only handle so many transactions per unit time Effects of contention: Increased end-to-end cost for messages Reduced available bandwidth for individual messages Causes imbalances across processors Particularly insidious performance problem Easy to ignore when programming Slow down messages that don’t even need that resource by causing other dependent resources to also congest Effect can be devastating: Don’t flood a resource!

71 KICS, UET Cavium Univ Program © Overlapping Communication Cannot afford to stall for high latencies even on uniprocessors! Overlap with computation or communication to hide latency Requires extra concurrency (slackness), higher bandwidth Techniques: Prefetching Block data transfer Proceeding past communication Multithreading

72 KICS, UET Cavium Univ Program © Summary of Tradeoffs Different goals often have conflicting demands Load Balance fine-grain tasks random or dynamic assignment Communication usually coarse grain tasks decompose to obtain locality: not random/dynamic Extra Work coarse grain tasks simple assignment Communication Cost: big transfers: amortize overhead and latency small transfers: reduce contention

73 KICS, UET Cavium Univ Program © Relationship between Perspectives

74 KICS, UET Cavium Univ Program © Summary Goal is to reduce denominator components Both programmer and system have role to play Architecture cannot do much about load imbalance or too much communication But it can: reduce incentive for creating ill-behaved programs (efficient naming, communication and synchronization) reduce artifactual communication provide efficient naming for flexible assignment allow effective overlapping of communication Busy(1) + Data(1) Busy useful (p)+Data local (p)+Synch(p)+Date remote (p)+Busy overhead (p)

75 Multi-Threading Parallel Programming on Shared Memory Multiprocessors Using PThread Chapter 2 Shameem Akhtar and Jason Roberts, Multi- Core Programming, Intel Press, 2006

76 KICS, UET Cavium Univ Program © Outline of Multi-Threading Topics Threads Terminology OS level view Hardware level threads Threading as a parallel programming model Types of thread level parallel programs Implementation issues

77 KICS, UET Cavium Univ Program © Threads Definition A discrete sequence of related instructions Executed independently of other such sequences Every program has at least one thread Initializes Executes instructions May create other threads Each thread maintains its current state OS maps a thread to hardware resources

78 KICS, UET Cavium Univ Program © System View of Threads Thread computational model layers: User level threads Kernel level threads Hardware threads

79 KICS, UET Cavium Univ Program © Flow of Threads in an Execution Environment Defining and preparing stage Operating stage Created and managed by the OS Execution stage

80 KICS, UET Cavium Univ Program © Threads Inside the OS

81 KICS, UET Cavium Univ Program © Processors, Processes, and Threads A processor runs threads from one or more processes, each of which contains one or more threads

82 KICS, UET Cavium Univ Program © Mapping Models of Threads to Processors: 1:1 Mapping

83 KICS, UET Cavium Univ Program © Mapping Models of Threads to Processors: M:1 Mapping

84 KICS, UET Cavium Univ Program © Mapping Models of Threads to Processors: M:N Mapping

85 KICS, UET Cavium Univ Program © Threads Inside the Hardware

86 KICS, UET Cavium Univ Program © Thread Creation Multiple threads inside a process Share same address space, FDs, etc. Operate independently Need their own stack space Who handles thread creation details Not the programmer Typically handled at system level OS support for threads Threading libraries Same is true for thread management

87 KICS, UET Cavium Univ Program © Stack Layout for a Multi-Threaded Process

88 KICS, UET Cavium Univ Program © Thread State Diagram

89 KICS, UET Cavium Univ Program © Thread Implementation Often implemented as a thread package Operations to create and destroy threads Synchronization mechanisms Approaches to implement a thread package: Implement as a thread library to execute entirely in user mode Have the kernel be aware of threads and schedule them

90 KICS, UET Cavium Univ Program © Thread Implementation (2) Characteristics of a user level thread library Cheap to create and destroy threads Switching thread context can be done in just a few instructions Need to save and restore CPU registers only No need to change memory maps, flush TLB, CPU accounting, etc. Drawback: a blocking system call will block all threads in a process Solution to blocking: implement thread in OS kernel

91 KICS, UET Cavium Univ Program © Kernel Implementations of Threads High price to solve blocking problem Every thread operation will require a system call Thread creation Thread deletion Thread synchronization Thread switching will now become as expensive as process context switching

92 KICS, UET Cavium Univ Program © Kernel Implementations of Threads (2) Lightweight processes (LWP) A hybrid form of user and kernel level threads An LWP runs in the context of a (heavy-weight) process There can be several LWPs  each with its own scheduler and stack System also offers a user level thread package for usual operations (creation, deletion, and synchronization) Assignment of a user level thread to LWP is hidden from programmer LWP handles the scheduling for multiple threads

93 KICS, UET Cavium Univ Program © LWP Implementation Thread table is shared among LWPs Protected through mutexes  no kernel intervention for LWP synch. When an LWP finds a runnable thread  switches context to that thread  done entirely in user space When a thread makes a blocking system call: OS might block one LWP May switch to another LWP  will allow other threads to continue

94 Parallel Programming with Threads Overview of POSIX threads, data races and types of synchronization

95 KICS, UET Cavium Univ Program © Shared Memory Programming Several Thread Libraries PTHREADS is the POSIX Standard Solaris threads are very similar Relatively low level Portable but possibly slow OpenMP is newer standard Support for scientific programming on shared memory Multiple other efforts by specific vendors

96 KICS, UET Cavium Univ Program © Overview of POSIX Threads POSIX: Portable Operating System Interface for UNIX Interface to Operating System utilities PThreads: The POSIX threading interface System calls to create and synchronize threads Should be relatively uniform across UNIX-like OS platforms PThreads contain support for Creating parallelism Synchronizing No explicit support for communication, because shared memory is implicit; a pointer to shared data is passed to a thread

97 KICS, UET Cavium Univ Program © POSIX Thread Creation Signature: int pthread_create(pthread_t *, const pthread_attr_t *, void * (*)(void *), void *); Example call: pthread_create(&thread_id; &thread_attribute &thread_fun; &fun_arg);

98 KICS, UET Cavium Univ Program © POSIX Thread Creation (2) thread_id is the thread id or handle (used to halt, etc.) thread_attribute various attributes standard default values obtained by passing a NULL pointer thread_fun the function to be run (takes and returns void*) fun_arg an argument can be passed to thread_fun when it starts errorcode will be set nonzero if the create operation fails

99 KICS, UET Cavium Univ Program © Simple Threading Example void* SayHello(void *foo) { printf( "Hello, world!\n" ); return NULL; } int main() { pthread_t threads[16]; int tn; for(tn=0; tn<16; tn++) { pthread_create(&threads[tn], NULL, SayHello, NULL); } for(tn=0; tn<16 ; tn++) { pthread_join(threads[tn], NULL); } return 0; } Compile using gcc –lpthread

100 KICS, UET Cavium Univ Program © Loop Level Parallelism Many scientific application have parallelism in loops With threads: … my_stuff [n][n]; for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) … pthread_create (update_cell, …, my_stuff[i][j]); But overhead of thread creation is nontrivial update_cell should have a significant amount of work 1/pth if possible Also need i & j

101 KICS, UET Cavium Univ Program © Shared Data and Threads Variables declared outside of main are shared Object allocated on the heap may be shared (if pointer is passed) Variables on the stack are private: passing pointer to these around to other threads can cause problems

102 KICS, UET Cavium Univ Program © Shared Data and Threads (2) Often done by creating a large “thread data” struct Passed into all threads as argument Simple example: char *message = "Hello World!\n"; pthread_create( &thread1, NULL, (void*)&print_fun, (void*) message);

103 KICS, UET Cavium Univ Program © Setting Attribute Values Once an initialized attribute object exists, changes can be made. For example: To change the stack size for a thread to 8192 (before calling pthread_create), do this: pthread_attr_setstacksize(&my_attributes, (size_t)8192); To get the stack size, do this: size_t my_stack_size; pthread_attr_getstacksize(&my_attributes, &my_stack_size); Slide Source: Theewara Vorakosit

104 KICS, UET Cavium Univ Program © Other Attributes Other attributes: Detached state – set if no other thread will use pthread_join to wait for this thread (improves efficiency) Scheduling parameter(s) – in particular, thread priority Scheduling policy – FIFO or Round Robin Contention scope – with what threads does this thread compete for a CPU Stack address – explicitly dictate where the stack is located Lazy stack allocation – allocate on demand (lazy) or all at once, “up front”

105 KICS, UET Cavium Univ Program © Data Race Example Problem is a race condition on variable s in the program A race condition or data race occurs when: two processors (or two threads) access the same variable, and at least one does a write. The accesses are concurrent (not synchronized) so they could happen simultaneously Thread 1 for i = 0, n/2-1 s = s + f(A[i]) Thread 2 for i = n/2, n-1 s = s + f(A[i]) static int s = 0;

106 KICS, UET Cavium Univ Program © Basic Types of Synchronization: Barrier Barrier—global synchronization Especially common when running multiple copies of the same function in parallel SPMD “Single Program Multiple Data” simple use of barriers -- all threads hit the same one work_on_my_subgrid(); barrier; read_neighboring_values(); barrier;

107 KICS, UET Cavium Univ Program © Barrier (2) More complicated—barriers on branches (or loops) if (tid % 2 == 0) { work1(); barrier } else { barrier } Barriers are not provided in all thread libraries

108 KICS, UET Cavium Univ Program © Creating and Initializing a Barrier To (dynamically) initialize a barrier, use code similar to this (which sets the number of threads to 3): pthread_barrier_t b; pthread_barrier_init(&b,NULL,3); The second argument specifies an object attribute; using NULL yields the default attributes.

109 KICS, UET Cavium Univ Program © Creating and Initializing a Barrier To wait at a barrier, a process executes: pthread_barrier_wait(&b); This barrier could have been statically initialized by assigning an initial value created using the macro PTHREAD_BARRIER_INITIALIZER(3)

110 KICS, UET Cavium Univ Program © Basic Types of Synchronization: Mutexes Mutexes—mutual exclusion aka locks threads are working mostly independently need to access common data structure lock *l = alloc_and_init(); /* shared */ acquire(l); access data release(l);

111 KICS, UET Cavium Univ Program © Mutexes (2) Java and other languages have lexically scoped synchronization similar to cobegin/coend vs. fork and join tradeoff Semaphores give guarantees on “fairness” in getting the lock, but the same idea of mutual exclusion Locks only affect processors using them: pair-wise synchronization

112 KICS, UET Cavium Univ Program © Mutexes in POSIX Threads To create a mutex: #include pthread_mutex_t amutex = PTHREAD_MUTEX_INITIALIZER; pthread_mutex_init(&amutex, NULL); To use it: int pthread_mutex_lock(amutex); int pthread_mutex_unlock(amutex);

113 KICS, UET Cavium Univ Program © Mutexes in POSIX Threads (2) To deallocate a mutex int pthread_mutex_destroy(pthread_mutex_t *mutex); Multiple mutexes may be held, but can lead to deadlock: thread1 thread2 lock(a) lock(b) lock(b) lock(a)

114 KICS, UET Cavium Univ Program © Summary of Programming with Threads POSIX Threads are based on OS features Can be used from multiple languages Familiar language for most of program Ability to shared data is convenient Pitfalls Intermittent data race bugs are very nasty to find Deadlocks are usually easier, but can also be intermittent OpenMP is commonly used today as an alternative

115 Multi-Threaded Distributed Application Examples Distributed Operating Systems By Andrew S. Tanenbaum

116 KICS, UET Cavium Univ Program © Multithreaded Clients Distribution transparency Needed when a DS operates in a wide-area network environment Need some mechanism to hide communication latency Multithreading on client side is useful One connection per thread If one thread is blocked, other can do useful work More responsive to the user Example: a web browser One thread connected to a server can bring an HTML document Another thread connected to the same server can bring images while the first displays the text, scroll bars, etc.

117 KICS, UET Cavium Univ Program © Multithreaded Servers (1) A multithreaded server organized in a dispatcher/worker model.

118 KICS, UET Cavium Univ Program © Multithreaded Servers (2) Three ways to construct a server. ModelCharacteristics Threads Parallelism, blocking system calls Single-threaded process No parallelism, blocking system calls Finite-state machine Parallelism, nonblocking system calls

119 KICS, UET Cavium Univ Program © Clients Anatomy of a client process: User interface A major task for most clients is to interact with human users Provide a means to interact with a remote server An important class: Graphical User Interfaces (GUIs) Client side software  distribution transparency Example: X Windows system Used to control bit-mapped devices Monitor, keyboard, keyboard, and a pointing device X kernel (X Server) contains hardware-specific details  device drivers X uses an event-driven approach Captures events from devices Provides an interface in the form of Xlib for GUI/graphics applications Two types of applications: normal and window manager

120 KICS, UET Cavium Univ Program © The X-Window System The basic organization of the X Window System

121 KICS, UET Cavium Univ Program © User Interface: Compound Documents Function of a user interface is more than interacting with users! May allow multiple applications to share a single graphical window Use that window to exchange data through user actions Typical examples: Drag and drop Drag an icon representing a file on trash can icon Application associated with trash can will be activated to delete file In-place editing Image within a text document in a word processor Pointing on the image can activate a drawing tool Compound documents notion of user interface A collection of different documents (text, images, spreadsheets) Seamlessly integrated through user interface Different applications operate on different parts of the document

122 KICS, UET Cavium Univ Program © Client-Side Software for Distribution Transparency A possible approach to transparent replication of a remote object using a client-side solution Proxy replicates requests to all replicated servers Forms a single response for the client application  replication transparency Failure transparency is also possible through client middleware

123 KICS, UET Cavium Univ Program © Servers Organization of a server process: Design issues of a server Object servers Alternatives for invoking objects Object adapter General design of a server: Iterative server Handles all requests itself If necessary, returns a response to the requesting user Concurrent server Does not handle request itself Passes it to a separate thread or process and waits for the next request

124 KICS, UET Cavium Univ Program © Servers: General Design Issues Client-to-server binding using a daemon as in DCE Client-to-server binding using a superserver as in UNIX Other distinctions: Stateless server Stateful server

125 KICS, UET Cavium Univ Program © Key Takeaways of this Session A wealth of knowledge exists about developing parallel applications On legacy parallel architectures For high performance computing (HPC) applications Techniques are applicable to multi-core Similar decomposition, assignment, orchestration, and mapping Shared address space programming Wider range of applications  topic for next session

Download ppt "Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lecture 2 (Mapping Applications to Multi-core."

Similar presentations

Ads by Google