Presentation is loading. Please wait.

Presentation is loading. Please wait.

Design and Implementation of the CCC Parallel Programming Language Nai-Wei Lin Department of Computer Science and Information Engineering National Chung.

Similar presentations


Presentation on theme: "Design and Implementation of the CCC Parallel Programming Language Nai-Wei Lin Department of Computer Science and Information Engineering National Chung."— Presentation transcript:

1 Design and Implementation of the CCC Parallel Programming Language Nai-Wei Lin Department of Computer Science and Information Engineering National Chung Cheng University

2 ICS20042 Outline zIntroduction zThe CCC programming language zThe CCC compiler zPerformance evaluation zConclusions

3 ICS20043 Motivations zParallelism is the future trend zProgramming in parallel is much more difficult than programming in serial zParallel architectures are very diverse zParallel programming models are very diverse

4 ICS20044 Motivations zDesign a parallel programming language that uniformly integrates various parallel programming models zImplement a retargetable compiler for this parallel programming language on various parallel architectures

5 ICS20045 Approaches to Parallelism zLibrary approach yMPI (Message Passing Interface), pthread zCompiler approach yHPF (High Performance Fortran), HPC++ zLanguage approach yOccam, Linda, CCC (Chung Cheng C)

6 ICS20046 Models of Parallel Architectures zControl Model ySIMD: Single Instruction Multiple Data yMIMD: Multiple Instruction Multiple Data zData Model yShared-memory yDistributed-memory

7 ICS20047 Models of Parallel Programming zConcurrency yControl parallelism: simultaneously execute multiple threads of control yData parallelism: simultaneously execute the same operations on multiple data zSynchronization and communication yShared variables yMessage passing

8 ICS20048 Granularity of Parallelism zProcedure-level parallelism yConcurrent execution of procedures on multiple processors zLoop-level parallelism yConcurrent execution of iterations of loops on multiple processors zInstruction-level parallelism yConcurrent execution of instructions on a single processor with multiple functional units

9 ICS20049 The CCC Programming Language zCCC is a simple extension of C and supports both control and data parallelism zA CCC program consists of a set of concurrent and cooperative tasks zControl parallelism runs in MIMD mode and communicates via shared variables and/or message passing zData parallelism runs in SIMD mode and communicates via shared variables

10 ICS200410 Tasks in CCC Programs Control Parallel Data Parallel

11 ICS200411 Control Parallelism zConcurrency ytask ypar and parfor zSynchronization and communication yshared variables – monitors ymessage passing – channels

12 ICS200412 Monitors zThe monitor construct is a modular and efficient construct for synchronizing shared variables among concurrent tasks zIt provides data abstraction, mutual exclusion, and conditional synchronization

13 ICS200413 An Example - Barber Shop Barber Chair Customer

14 ICS200414 An Example - Barber Shop task::main( ) { monitor Barber_Shop bs; int i; par { barber( bs ); parfor (i = 0; i < 10; i++) customer( bs ); }

15 ICS200415 An Example - Barber Shop task::barber(monitor Barber_Shop in bs) { while ( 1 ) { bs.get_next_customer( ); bs.finished_cut( ); } task::customer(monitor Barber_Shop in bs) { bs.get_haircut( ); }

16 ICS200416 An Example - Barber Shop monitor Barber_Shop { int barber, chair, open; cond barber_available, chair_occupied; cond door_open, customer_left; Barber_Shop( ); void get_haircut( ); void get_next_customer( ); void finished_cut( ); };

17 ICS200417 An Example - Barber Shop Barber_Shop( ) { barber = 0; chair = 0; open = 0; } void get_haircut( ) { while (barber == 0) wait(barber_available); barber  = 1; chair += 1; signal(chair_occupied); while (open == 0) wait(door_open); open  = 1; signal(customer_left); }

18 ICS200418 An Example - Barber Shop void get_next_customer( ) { barber += 1; signal(barber_available); while (chair == 0) wait(chair_occupied); chair  = 1; } void get_haircut( ) { open += 1; signal(door_open); while (open > 0) wait(customer_left); }

19 ICS200419 Channels zThe channel construct is a modular and efficient construct for message passing among concurrent tasks zPipe: one to one zMerger: many to one zSpliter: one to many zMultiplexer: many to many

20 ICS200420 Channels zCommunication structures among parallel tasks are more comprehensive zThe specification of communication structures is easier zThe implementation of communication structures is more efficient zThe static analysis of communication structures is more effective

21 ICS200421 An Example - Consumer-Producer producerconsumer spliter

22 ICS200422 An Example - Consumer-Producer task::main( ) { spliter int chan; int i; par { producer( chan ); parfor (i = 0; i < 10; i++) consumer( chan ); }

23 ICS200423 An Example - Consumer-Producer task::producer(spliter in int chan) { int i; for (i = 0; i < 100; i++) put(chan, i); for (i = 0; i < 10; i++) put(chan, END); }

24 ICS200424 An Example - Consumer-Producer task::consumer(spliter in int chan) { int data; while ((data = get(chan)) != END) process(data); }

25 ICS200425 Data Parallelism zConcurrency ydomain – an aggregate of synchronous tasks zSynchronization and communication ydomain – variables in global name space

26 ICS200426 An Example – Matrix Multiplication  =

27 ICS200427 An Example– Matrix Multiplication domain matrix_op[16] { int a[16], b[16], c[16]; multiply(distribute in int [16:block][16], distribute in int [16][16:block], distribute out int [16:block][16]); };

28 ICS200428 task::main( ) { int A[16][16], B[16][16], C[16][16]; domain matrix_op m; read_array(A); read_array(B); m.multiply(A, B, C); print_array(C); } An Example– Matrix Multiplication

29 ICS200429 matrix_op::multiply(A, B, C) distribute in int [16:block][16] A; distribute in int [16][16:block] B; distribute out int [16:block][16] C; { int i, j; a := A; b := B; for (i = 0; i < 16; i++) for (c[i] = 0, j = 0; j < 16; j++) c[i] += a[j] * matrix_op[i].b[j]; C := c; } An Example– Matrix Multiplication

30 ICS200430 Platforms for the CCC Compiler zPCs and SMPs yPthread: shared memory + dynamic thread creation zPC clusters and SMP clusters yMillipede: distributed shared memory + dynamic remote thread creation zThe similarities between these two classes of machines enable a retargetable compiler implementation for CCC

31 ICS200431 Organization of the CCC Programming System CCC compiler CCC runtime library Virtual shared memory machine interface CCC applications PthreadMillipede SMPSMP cluster

32 ICS200432 The CCC Compiler zTasks → threads zMonitors → mutex locks, read-write locks, and condition variables zChannels → mutex locks and condition variables zDomains → set of synchronous threads zSynchronous execution → barriers

33 ICS200433 Virtual Shared Memory Machine Interface zProcessor management zThread management zShared memory allocation zMutex locks zRead-write locks zCondition variables zBarriers

34 ICS200434 The CCC Runtime Library zThe CCC runtime library contains a collection of functions that implements the salient abstractions of CCC on top of the virtual shared memory machine interface

35 ICS200435 Performance Evaluation zSMPs yHardware : an SMP machine with four CPUs, each CPU is an Intel PentiumII Xeon 450MHz, and cache is 512K ySoftware : OS is Solaris 5.7 and library is pthread 1.26 zSMP clusters yHardware : four SMP machines, each of which has two CPUs, each CPU is Intel PentiumIII 500MHz, and cache is 512K ySoftware : OS is windows 2000 and library is millipede 4.0 yNetwork : Fast ethernet network 100Mbps

36 ICS200436 Benchmarks zMatrix multiplication (1024 x 1024) z W arshall’s transitive closure (1024 x 1024) zAirshed simulation (5)

37 ICS200437 Matrix Multiplication (SMPs) Sequential 1thread/1cpu2threads/1cpu4threads/1cpu8threads/1cpu CCC (1 cpu) 287.5 295.05 (0.97, 0.97) 264.24 (1.08, 1.08) 250.45 (1.14, 1.14) 275.32 (1.04, 1.04) Pthread (1 cpu) 292.42 (0.98, 0.98) 257.45 (1.12, 1.12) 244.24 (1.17, 1.17) 266.20 (1.08, 1.08) CCC (2 cpu) 152.29 (1.89, 0.94) 110.54 (2.6, 1.3) 98.32 (2.93, 1.46) 124.44 (2.31, 1.16) Pthread (2 cpu) 149.88 (1.91, 0.96) 105.45 (2.72, 1.36) 93.56 (3.07, 1.53) 119.42 (2.41, 1.20) CCC (4 cpu) 76.39 (3.76, 0.94) 69.44 (4.14, 1.03) 73.54 (3.90, 0.98) Pthread (4 cpu) 74.72 (3.85, 0.96) 65.42 (4.39, 1.09) 69.88 (4.11, 1.02) 64.44 (4.46, 1.11) 59.44 (4.83, 1.20) (unit : sec)

38 ICS200438 Matrix Multiplication (SMP clusters) Sequential 1thread/1cpu2threads/1cpu4threads/1cpu8threads/1cp u CCC (1mach x 2cpu)470.44 253.12 (1.85, 0.929) 201.23 (2.33, 1.16) 158.31 (2.97, 1.48) 234.46 (2.0, 1.0) Millipede (1mach x 2cpu) 248.11 (1.89, 0.95) 196.33 (2.39, 1.19) 154.22 (3.05, 1.53) 224.95 (2.09, 1.05) CCC (2mach x 2cpu) 136.34 (3.45, 0.86) 102.25 (4.6, 1.15) 96.25 (4.89, 1.22) 148.25 (3.17, 0.79) Millipede (2mach x 2cpu) 129.33 (3.63, 0.91) 96.52 (4.87, 1.22) 91.45 (5.14, 1.27) 142.45 (3.31, 0.82) CCC (4mach x 2cpu) 87.25 (5.39, 0.67) 62.33 (7.54, 0.94) 80.25 (5.45, 0.73) 102.45 (4.67, 0.58) Millipede (4mach x 2cpu) 78.37 (6.0, 0.75) 54.92 (8.56, 1.07) 75.98 (5.57, 0.75) 95.44 (4.87, 0.61) (unit : sec)

39 ICS200439 Warshall’s Transitive Closure (SMPs) Sequtial 1thread/1cpu2threads/1cpu4threads/1cpu 8threads/1cpu CCC (1 cpu)150.32 152.88 (0.98, 0.98) 138.44 (1.08, 1.08) 143.54 (1.05, 1.05) 154.33 (0.97, 0.97) Pthread (1 cpu) 151.25 (0.99, 0.99) 135.45 (1.11, 1.11) 139.21 (1.07, 1.07) 152.44 (0.99, 0.99) CCC (2 cpu) 83.36 (1.80, 0.90) 69.45 (2.16, 1.08) 78.54 (1.91, 0.96) 98.24 (1.53, 0.77) Pthread (2 cpu) 79.32 (1.90, 0.95) 66.85 (2.25, 1.12) 74.24 (2.02, 1.01) 93.44 (1.60, 0.80) CCC (4 cpu) 49.43 (3.04, 0.76) 43.19 (3.48, 0.87) 58.44 (2.57, 0.64) 77.42 (1.94, 0.49) Pthread (4 cpu) 44.14 (3.40, 0.85) 40.89 (3.68, 0.91) 55.23 (2.72, 0.68) 74.21 (2.02, 0.51) (unit : sec)

40 ICS200440 Warshall’s Transitive Closure (SMP clusters) Sequential 1thread/1cpu 2threads/1cpu4threads/1cpu8threads/1cpu CCC (1mach x 2cpu)305.35 159.24 (1.91, 0.96) 132.81 (2.29, 1.14) 102.19 (2.98, 1.49) 153.90 (1.98, 0.99) Millipade (1mach x 2cpu) 155.34 (1.96, 0.98) 125.91 (2.42, 1.21) 95.29 (3.20, 1.59) 144.53 (2.11, 1.56) CCC (2mach x 2cpu) 100.03 (3.05, 0.76) 82.40 (3.70, 0.92) 148.97 (2.04, 0.52) 202.78 (1.50, 0.38) Millipede (2mach x 2cpu) 88.45 (3.45, 0.86) 75.91 (4.02, 1.00) 140.28 (2.17, 0.54) 189.38 (1.61, 0.41) CCC (4mach x 2cpu) 60.06 (5.08, 0.64) 54.56 (5.59, 0.70) 89.68 (3.40, 0.43) 138.76 (2.20, 0.27) Millipede (4mach x 2cpu) 54.05 (5.65, 0.71) 47.53 (6.42, 0.80) 81.28 (3.75, 0.46) 129.96 (2.36, 0.30) (unit : sec)

41 ICS200441 Seq 5\5\51\5\55\1\55\5\11\1\51\5\15\1\1 CCC(2cpu)14.2 8.68 (1.6,0.8) 8.84 (1.6,0.8) 10.52 (1.3,0.6) 12.87 (1.1,0.5) 10.75 (1.3,0.6) 13.2 (1.1,0.5) 14.85 (0.9,0.4) Pthread (2cpu) 14.2 8.63 (1.6,0.8) 8.82 (1.6,0.8) 10.42 (1.3,0.6) 12.84 (1.1,0.5) 10.72 (1.3,0.6) 13.19 (1.1,0.5) 14.82 (0.9,0.4) CCC(4cpu)14.2 6.49 (2.1,0.5) 6.84 (2.1,0.5) 9.03 (1.5,0.3) 12.08 (1.1,0.2) 9.41 (1.5,0.3) 12.46 (1.1,0.2) 14.66 (0.9,0.2) Pthread (4cpu) 14.2 6.37 (2.2,0.5) 6.81 (2.1,0.5) 9.02 (1.5,0.3) 12.07 (1.1,0.2) 9.38 (1.5,0.3) 12.44 (1.1,0.2) 14.62 (0.9,0.2) Airshed simulation (SMPs) threads (unit : sec)

42 ICS200442 Seq 5\5\51\5\55\1\55\5\11\1\51\5\15\1\1 CCC ( 1m x 2p ) 49.7 26.13 (1.9,0.9) 26.75 (1.8,0.9) 30.37 (1.6,0.8) 44.25 (1.1,0.5) 31.97 (1.5,0.7) 45.25 (1.1,0.5) 48.51 (1.1,0.5) Millipede ( 1m x 2p ) 49.9 20.02 (2.4,1.2) 20.87 (2.3,1.1) 26.05 (1.9,0.9) 30.41 (1.6,0.8) 26.42 (1.8,0.9) 31.13 (1.5,0.7) 35.89 (1.3,0.6) CCC ( 2m x 2p ) 49.9 26.41 (1.8,0.4) 27.51 (1.8,0.4) 50.42 (0.9,0.2) 56.68 (0.8,0.2) 54.76 (0.9,0.2) 58.25 (0.8,0.2) 91.17 (0.5,0.1) Millipede ( 2m x 2p ) 49.9 19.98 (2.4,0.6) 21.84 (2.2,0.5) 31.33 (1.5,0.4) 39.31 (1.2,0.3) 30.85 (1.6,0.4) 42.13 (1.1,0.2) 36.38 (1.3,0.3) CCC ( 4m x 2p ) 49.9 23.09 (2.1,0.2) 25.59 (1.9,0.2) 48.97 (1.0,0.1) 58.31 (0.8,0.1) 53.33 (0.9,0.1) 61.96 (0.8,0.1) 89.61 (0.5,0.1) Millipede ( 4m x 2p ) 49.9 16.72 (2.9,0.3) 17.61 (2.8,0.3) 35.11 (1.4,0.2) 41.03 (1.2,0.1) 33.95 (1.4,0.2) 40.88 (1.2,0.1) 36.07 (1.3,0.1) Airshed simulation (SMP clusters) threads (unit : sec)

43 ICS200443 Conclusions zA high-level parallel programming language that uniformly integrates yBoth control and data parallelism yBoth shared variables and message passing zA modular parallel programming language zA retargetable compiler


Download ppt "Design and Implementation of the CCC Parallel Programming Language Nai-Wei Lin Department of Computer Science and Information Engineering National Chung."

Similar presentations


Ads by Google