Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Multiprocessors zWhy multiprocessors? zCPUs and accelerators. zMultiprocessor performance.

Similar presentations


Presentation on theme: "© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Multiprocessors zWhy multiprocessors? zCPUs and accelerators. zMultiprocessor performance."— Presentation transcript:

1

2 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Multiprocessors zWhy multiprocessors? zCPUs and accelerators. zMultiprocessor performance analysis.

3 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Why multiprocessors? zBetter cost/performance. yMatch each CPU to its tasks or use custom logic (smaller, cheaper). yCPU cost is a non-linear function of performance. cost performance

4 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Why multiprocessors? cont’d. zBetter real-time performance. yPut time-critical functions on less-loaded processing elements. yRemember RMS utilization---extra CPU cycles must be reserved to meet deadlines. cost performance deadline deadline w. RMS overhead

5 Why multiprocessors? cont’d. zUsing specialized processors or custom logic saves power. zDesktop uniprocessors are not power-efficient enough for battery- powered applications. © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. [Aus04] © 2004 IEEE Computer Society

6 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Why multiprocessors? cont’d. zGood for processing I/O in real-time. zMay consume less energy. zMay be better at streaming data. zMay not be able to do all the work on even the largest single CPU.

7 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerated systems zUse additional computational unit dedicated to some functions? yHardwired logic. yExtra CPU. zHardware/software co-design: joint design of hardware and software architectures.

8 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerated system architecture CPU accelerator memory I/O request data result data

9 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerator vs. co- processor zA co-processor executes instructions. yInstructions are dispatched by the CPU. zAn accelerator appears as a device on the bus. yThe accelerator is controlled by registers.

10 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerator implementations zApplication-specific integrated circuit. zField-programmable gate array (FPGA). zStandard component. yExample: graphics processor.

11 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. System design tasks zDesign a heterogeneous multiprocessor architecture. yProcessing element (PE): CPU, accelerator, etc. zProgram the system.

12 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerated system design zFirst, determine that the system really needs to be accelerated. yHow much faster is the accelerator on the core function? yHow much data transfer overhead? zDesign the accelerator itself. zDesign CPU interface to accelerator.

13 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerated system platforms zSeveral off-the-shelf boards are available for acceleration in PCs: yFPGA-based core; yPC bus interface.

14 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerator/CPU interface zAccelerator registers provide control registers for CPU. zData registers can be used for small data objects. zAccelerator may include special-purpose read/write logic. yEspecially valuable for large data transfers.

15 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. System integration and debugging zTry to debug the CPU/accelerator interface separately from the accelerator core. zBuild scaffolding to test the accelerator. zHardware/software co-simulation can be useful.

16 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Caching problems zMain memory provides the primary data transfer mechanism to the accelerator. zPrograms must ensure that caching does not invalidate main memory data. yCPU reads location S. yAccelerator writes location S. yCPU writes location S. BAD

17 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Synchronization zAs with cache, main memory writes to shared memory may cause invalidation: yCPU reads S. yAccelerator writes S. yCPU reads S.

18 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Multiprocessor performance analysis zEffects of parallelism (and lack of it): yProcesses. yCPU and bus. yMultiple processors.

19 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerator speedup zCritical parameter is speedup: how much faster is the system with the accelerator? zMust take into account: yAccelerator execution time. yData transfer time. ySynchronization with the master CPU.

20 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerator execution time zTotal accelerator execution time: yt accel = t in + t x + t out Data input Accelerated computation Data output

21 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerator speedup zAssume loop is executed n times. zCompare accelerated system to non- accelerated system: yS = n(t CPU - t accel ) y = n[t CPU - (t in + t x + t out )] Execution time on CPU

22 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Single- vs. multi-threaded zOne critical factor is available parallelism: ysingle-threaded/blocking: CPU waits for accelerator; ymultithreaded/non-blocking: CPU continues to execute along with accelerator. zTo multithread, CPU must have useful work to do. yBut software must also support multithreading.

23 © 2008 Wayne Wolf Overheads for Computers as Components Total execution time zSingle-threaded:z Multi-threaded: P2 P1 A1 P3 P4 P2 P1 A1 P3 P4

24 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Execution time analysis zSingle-threaded: yCount execution time of all component processes. z Multi-threaded: yFind longest path through execution.

25 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Sources of parallelism zOverlap I/O and accelerator computation. yPerform operations in batches, read in second batch of data while computing on first batch. zFind other work to do on the CPU. yMay reschedule operations to move work after accelerator initiation.

26 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Data input/output times zBus transactions include: yflushing register/cache values to main memory; ytime required for CPU to set up transaction; yoverhead of data transfers by bus packets, handshaking, etc.

27 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Scheduling and allocation zMust: yschedule operations in time; yallocate computations to processing elements. zScheduling and allocation interact, but separating them helps. yAlternatively allocate, then schedule.

28 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Example: scheduling and allocation P1P2 P3 d1d2 Task graph Hardware platform M1M2

29 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. First design zAllocate P1, P2 -> M1; P3 -> M2. time M1 M2 P1P2 P3 P1CP2C

30 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Second design zAllocate P1 -> M1; P2, P3 -> M2: M1 M2 P1 P2P3 P1C time

31 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Example: adjusting messages to reduce delay zTask graph:z Network: P1P2 P3 d1 d2 M1M2M3 allocation 3 4 3 execution time Transmission time = 4

32 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Initial schedule time M1 M2 M3 network 0 2010515 P1 P2 d1d2 P3 Time = 15

33 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. New design zModify P3: yreads one packet of d1, one packet of d2 ycomputes partial result ycontinues to next packet

34 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. New schedule time M1 M2 M3 network 0 2010515 P1 P2 d1 P3 d2d1 P3 d2d1 P3 d2d1 P3 d2 Time = 12

35 © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Buffering and performance zBuffering may sequentialize operations. yNext process must wait for data to enter buffer before it can continue. zBuffer policy (queue, RAM) affects available parallelism.

36 Buffers and latency zThree processes separated by buffers: © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. B1A B2BB3C

37 Buffers and latency schedules A[0] A[1] … B[0] B[1] … C[0] C[1] … A[0] B[0] C[0] A[1] B[1] C[1] … © 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Must wait for all of A before getting any B


Download ppt "© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Multiprocessors zWhy multiprocessors? zCPUs and accelerators. zMultiprocessor performance."

Similar presentations


Ads by Google