Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 10 Hardware Accelerators Ingo Sander

Similar presentations


Presentation on theme: "Lecture 10 Hardware Accelerators Ingo Sander"— Presentation transcript:

1 Lecture 10 Hardware Accelerators Ingo Sander ingo@kth.se

2 Introduction Hardware Accelerator

3 October 20, 2015IL2206 Embedded Systems3 Design Constraint Propagation A design constraint on system level leads to new design constraints on subsystem level P1P2P3 t < 500 ms Constraints on subsystems t < 100 mst < 250 mst < 150 ms Constraint on System

4 October 20, 2015IL2206 Embedded Systems4 Design Constraint Propagation An estimation tool can give the execution time of a subsystem What happens, if a subsystem is too slow? P1P2P3 t < 500 ms Constraints on subsystems t < 100 mst < 250 mst < 150 ms Constraint on System Execution Time 95 ms 280 ms145 ms Too Slow!

5 October 20, 2015IL2206 Embedded Systems5 How to improve the performance of a microprocessor system? Improve your code Choose a faster version of your microprocessor Add additional computational units that are perform special functions? Standard Component (Graphics Processor) Coprocessor (Floating-Point Processor) Additional Microprocessor Hardware Accelerator

6 October 20, 2015IL2206 Embedded Systems6 Hardware Accelerators If the overall performance of a uniprocessor system is too slow, additional hardware can be used to speed up the system. This hardware is called hardware accelerator! The hardware accelerator is a component that works together with the processor and executes key functions much faster than the processor. © 2000 Wolf (Morgan Kaufman)

7 October 20, 2015IL2206 Embedded Systems7 Accelerated System Architecture CPU accelerator memory I/O Request 1 Data 2 Result 3 Request and Result may also require access to memory © 2000 Wolf (Morgan Kaufman)

8 October 20, 2015IL2206 Embedded Systems8 An Accelerator is not a Co- Processor A co-processor is connected to the CPU and executes special instructions. Instructions are dispatched by the CPU. An accelerator appears as a device on the bus © 2000 Wolf (Morgan Kaufman)

9 October 20, 2015IL2206 Embedded Systems9 Amdahl’s Law Amdahl’s law states that the performance improvement of an improved unit is limited by the fraction of time the unit is in use! Fraction denotes the percentage the enhancement can be used!

10 October 20, 2015IL2206 Embedded Systems10 Example (Henessey & Patterson) An application uses the floating point square root 20% of the time and floating point operations 50% of the time. Is it better to implement a square root unit that speeds up this operation with a factor of 10, or to Improve the floating-point instructions in general so that they can run 2 times faster.

11 October 20, 2015IL2206 Embedded Systems11 Example (Henessey & Patterson) Square Root: Speedup = 1 / ((1-0.2)+0.2/10) = 1/0.82 = 1.22 Floating-Point: Speedup = 1 / ((1-0.5)+0.5/2) = 1/0.75 = 1.33

12 October 20, 2015IL2206 Embedded Systems12 Almdahl’s Law Lessons to be learned The maximum speedup that is possible is limited by the fraction! Assume infinite speedup Speedup = 1 / ((1-F)+F/Infinity) = 1/(1-F) Fraction F0.10.30.50.9 Max. Speedup1.111.43210 Improve the common cases!

13 October 20, 2015IL2206 Embedded Systems13 Amdahl’s Law for Parallel Architectures Amdahl’s law can even be used for parallel architectures, where sequential code is parallelized and runs on identical parallel units! Fraction denotes the percentage of the code parallelism can be used!

14 Design Hardware Accelerator

15 October 20, 2015IL2206 Embedded Systems15 Design of a hardware accelerator Which functions shall be implemented in hardware and which functions in software? Hardware/software co-design: joint design of hardware and software architectures The hardware accelerator can be implemented in Application-specific integrated circuit. Field-programmable gate array (FPGA).

16 October 20, 2015IL2206 Embedded Systems16 Hardware Software Co-Design SW Compilation Executable Program System Model Original Program (concurrent processes Partitioning & Mapping Which functions shall go to HW and SW? Netlist HW Synthesis Veri fication HW-Model (VHDL) SW-Model (C/C++) Veri fication Estimation Library Good estimates are needed for good partitioning

17 October 20, 2015IL2206 Embedded Systems17 Hardware/Software Co-Design Hardware/Software Co-design covers the following problems Co-Specification: the creations of specifications that describe both the hardware and software of a system Co-Synthesis: The automatic or semi-automatic design of hardware and software to meet a specification Co-Simulation: The simultaneous simulation of hardware and software elements on different levels of abstraction

18 October 20, 2015IL2206 Embedded Systems18 Co-Synthesis Four tasks are included in co-synthesis Partitioning: The functionality of the system is divided into smaller, interacting computation units Allocation: The decision, which computational resources are used to implement the functionality of the system Scheduling: If several system functions have to share the same resource, the usage of the resource must be scheduled in time Mapping: The selection of a particular allocated computational unit for each computation unit All these tasks depend on each other!

19 October 20, 2015IL2206 Embedded Systems19 Partitioning During partitioning the functionality of the system is partitioned into several parts (corresponding to the allocated/available components) Many possible partitions exist Analysis is done by evaluating the costs of different partitions B A E DC B A E DC

20 October 20, 2015IL2206 Embedded Systems20 Estimation In order to get a good partitioning, there is a need for good figures about performance for a function on different components execution time for communication time

21 October 20, 2015IL2206 Embedded Systems21 Estimation Accuracy and Fidelity The accuracy of an estimate is a measure how close the estimate is to the actual value on the real implementation The fidelity of an estimation method is defined as percentage of correctly predicted comparisons between design implementations

22 October 20, 2015IL2206 Embedded Systems22 Fidelity Though accuracy is much higher in (2) than in (1), the estimates are not very useful for the partitioning process because of the low fidelity! This can cause bad design decisions! Quality metric AB C AB C Fidelity = 100%Fidelity = 33% (only A > C correct) 1 2 Estimate Measurement

23 October 20, 2015IL2206 Embedded Systems23 Hardware/Software Co-Design Strategies: 1. Start with an ”all-software”-configuration While (Constraints are not satisfied) Move the SW function that gives the best improvement to HW (implemented in COSYMA [Ernst, Henkel, Brenner 1993]) 2. Start with an ”all-hardware”-configuration While (Constraints are satisfied) Move the most costly HW component to SW (implemented in Vulcan [Gupta, DeMicheli 1995])

24 October 20, 2015IL2206 Embedded Systems24 Papers on HW/SW Co-Design R. Ernst et al. Hardware-software co-synthesis from Microcontrollers. IEEE Design & Test of Computers. December 1993. R. K. Gupta and G. de Micheli. Hardware-software cosynthesis for digital systems. IEEE Design & Test of Computers. December 1993. G. de Micheli and R. K. Gupta. Hardware/software co-design. Proceedings of the IEEE. March 1997. … (and much much more) Electronic versions of these and other papers can be accessed by the KTH Library (www.lib.kth.se)

25 October 20, 2015IL2206 Embedded Systems25 System design tasks Design a heterogeneous multiprocessor architecture. Processing element (PE): CPU, accelerator, etc. Divide Tasks to Processing Elements Verify that Functionality of the system is correct System meets the performance constraints

26 October 20, 2015IL2206 Embedded Systems26 Why accelerators? Better cost/performance. Custom logic may be able to perform operation faster than a CPU of equivalent cost. CPU cost is a non-linear function of performance. To improve performance by choosing a faster CPU may be very expensive! cost performance © 2000 Wolf (Morgan Kaufman)

27 October 20, 2015IL2206 Embedded Systems27 Accelerated system design First, determine that the system really needs to be accelerated. Which core function(s) shall be accelerated? (Partitioning) How much faster is the accelerator on the core function? How much is the data transfer overhead? Design Tasks performance analysis; scheduling and allocation. Design the accelerator itself. Design CPU interface to accelerator. © 2000 Wolf (Morgan Kaufman)

28 October 20, 2015IL2206 Embedded Systems28 Performance analysis Critical parameter is speedup: how much faster is the system with the accelerator? Must take into account: Accelerator execution time. Data transfer time. Synchronization with the master CPU. The Accelerator needs to know, when it can start its computation The CPU needs to know when the results are ready © 2000 Wolf (Morgan Kaufman)

29 October 20, 2015IL2206 Embedded Systems29 Single- vs. multi-threaded One critical factor is available parallelism: single-threaded/blocking: CPU waits for accelerator; multithreaded/non-blocking: CPU continues to execute along with accelerator. To multithread, CPU must have useful work to do. But software must also support multithreading. © 2000 Wolf (Morgan Kaufman)

30 October 20, 2015IL2206 Embedded Systems30 Sources of parallelism Overlap I/O and accelerator computation. Perform operations in batches, read in second batch of data while computing on first batch. Find other work to do on the CPU. May reschedule operations to move work after accelerator initiation. © 2000 Wolf (Morgan Kaufman)

31 October 20, 2015IL2206 Embedded Systems31 Total execution time Single-threaded: Multi-threaded: P2 P1 A1 P3 P4 P2 P1 A1 P3 P4 CPU Accel. CPU Accel. Split Join © 2000 Wolf (Morgan Kaufman)

32 October 20, 2015IL2206 Embedded Systems32 Communication Overhead Data input/output times Bus transactions include: flushing register/cache values to main memory; time required for CPU to set up transaction; overhead of data transfers by bus packets, handshaking, etc. © 2000 Wolf (Morgan Kaufman)

33 October 20, 2015IL2206 Embedded Systems33 Accelerator execution time Total accelerator execution time: t accel = t in + t x + t out Data input Accelerated computation Data output © 2000 Wolf (Morgan Kaufman)

34 October 20, 2015IL2206 Embedded Systems34 Execution time analysis Single-threaded: Count execution time of all component processes. Multi-threaded: Find longest path through execution. P1 A1 P2P3P4 CPU Acc. Time Execution Time Communication Overhead t in t out txtx P1 A1 P2P3P4 CPU Acc. Time Execution Time

35 October 20, 2015IL2206 Embedded Systems35 Example for Accelerator Architecture CPU Mem DMA Bus Interface Read Unit Write Unit Registers Core Accelerator © 2000 Wolf (Morgan Kaufman)

36 October 20, 2015IL2206 Embedded Systems36 Accelerator/CPU interface Accelerator registers provide control registers for CPU. Data registers can be used for small data objects. Accelerator may include special-purpose read/write logic. Especially valuable for large data transfers. © 2000 Wolf (Morgan Kaufman)

37 October 20, 2015IL2206 Embedded Systems37 Caching problems Main memory provides the primary data transfer mechanism to the accelerator. Programs must ensure that caching does not invalidate main memory data (Assume a cache in CPU). © 2000 Wolf (Morgan Kaufman)

38 October 20, 2015IL2206 Embedded Systems38 Possible Problems with Caches 1. CPU reads location S. 2. Accelerator writes location S. 3. CPU reads location S. Cache S CPU Memory Accelerator 1 2 3 Wrong value! © 2000 Wolf (Morgan Kaufman)

39 October 20, 2015IL2206 Embedded Systems39 Cache Coherence Problem Cache coherence problems appears also on multiprocessor systems Cache and main memory do not have the same contents Avalon bus, like most on-chip busses do not have an inbuilt mechanism to avoid these problems P1P1 Cache Main Memory Bus PnPn Cache

40 October 20, 2015IL2206 Embedded Systems40 Cache Coherence with Write- Through Caches How to tackle cache coherence? Idea: Caches must be aware of the transactions on the bus! Add extra hardware and define a protocol to be able to detect invalid data in the caches Take actions, if cache or memory (in case of write-back caches) is invalid P1P1 Cache Main Memory Bus PnPn Cache Cache-Memory Transition Bus Snooping V I V I Cache Coherence Protocol More about cache coherence protocols in IL2207 SoC Architectures

41 What to do, if no cache coherence protocol exists? Designer has to be aware of possible cache coherence problems Disciplined programming is needed Use commands to explicitly bypass the cache, if risk for cache coherence problem October 20, 2015IL2206 Embedded Systems41

42 October 20, 2015IL2206 Embedded Systems42 Example Accelerator fg x y h h(f(x),g(y)) PAM Data-flow Graph Architecture

43 October 20, 2015IL2206 Embedded Systems43 Execution Times Both P and A have sufficient registers P and A cannot access the bus simultaneously A memory access (load or store) takes 1 time unit PA f52 g52 h5-

44 October 20, 2015IL2206 Embedded Systems44 Single-Processor Solution fg x y h h(f(x),g(y)) Data-flow Graph P P P Load x1 Load y1 f5 g5 h5 Store h(...)1 18

45 October 20, 2015IL2206 Embedded Systems45 Processor-Accelerator Solution I fg x y h h(f(x),g(y)) Data-flow Graph A P A PA Load x1 Load y1 f2 g2 Store f1 Store g1 Load f1 Load g1 h5 Store h1 Total16 Still Single-Thread!

46 October 20, 2015IL2206 Embedded Systems46 Processor-Accelerator Solution II fg x y h h(f(x),g(y)) Data-flow Graph A P P PA Load y1 g5Load x1 f2 Store f1 Load f1 h5 Store h1 Total13 Exploitation of parallelism leads to fast solution!

47 October 20, 2015IL2206 Embedded Systems47 System integration and debugging Try to debug the CPU/accelerator interface separately from the accelerator core. Build equipment to test the accelerator. Hardware/software co-simulation can be useful. © 2000 Wolf (Morgan Kaufman)

48 October 20, 2015IL2206 Embedded Systems48 Summary The use of a hardware accelerator can lead to a more efficient solution In particular when the parallelism in the functionality can be exploited Hardware/Software co-design techniques can be used for the design of an accelerator You have to be aware of cache coherence problems, if the processor or accelerator uses a cache

49 Configurable Processor Cores Ingo Sander ingo@kth.se

50 October 20, 2015IL2206 Embedded Systems50 Motivation for Configurable Processor Cores Observations Time-to-market is critical Development time for software is much smaller than for hardware Hardware can be customized and has much better performance than software solution

51 October 20, 2015IL2206 Embedded Systems51 Why Configurable Processor Cores? Idea Combine the advantages of hardware and software in form of a customizable processor to achieve Clearly shorter Time-To-Market than hardware Clearly better performance than software Provide a processor platform with a basic architecture that can be extended by additional optimized units (MAC, Floating-Point Unit) Own instructions together with own customized hardware can be defined for the processor

52 October 20, 2015IL2206 Embedded Systems52 Example for a configurable processor: Xtensa (Tensilica) The Xtensa processor core targets system-on-chip applications is configurable, extensible and synthesizable has Base Instruction Set Architecture Configurable Functions (Parametrised) Optional Functions Designer-Defined Functions and Registers (For Accleration of Specific Algorithms)

53 October 20, 2015IL2206 Embedded Systems53 Xtensa Processor Core

54 October 20, 2015IL2206 Embedded Systems54 Basic Xtensa Core 32-bit architecture Base configuration: 32-bit ALU Up to 64 general purpose registers 6 special purpose registers 80 base instructions Improved 16- and 24-bit RISC instruction encoding

55 October 20, 2015IL2206 Embedded Systems55 Optional Architecture Execution Units Multipliers, 16 and 32 bits MAC-Unit, Floating-Point Unit Interface Options Memory Subsystem Options Memory Management Options Local Data and Instruction Caches Separate RAM, ROM Areas for Data and Instruction

56 October 20, 2015IL2206 Embedded Systems56 Tensilica Extension Language The Tensilica extension language is used to describe new instructions, registers and execution units that are then automatically added to the Xtensa processor

57 October 20, 2015IL2206 Embedded Systems57 Xtensa Processor Design Process

58 October 20, 2015IL2206 Embedded Systems58 Design Flow 1. Choose basic Xtensa processor 2. Specify algorithm in C 3. Compile to Target Processor 4. Profile and check, if design constraints are met 5. If constraints are met, everything is fine, otherwise 6. Choose optional functions (e.g. Multiplier) or design new instructions for the critical part => improved architecture 7. Adjust your code for the new architecture 8. Go back to 3.

59 October 20, 2015IL2206 Embedded Systems59 Summary The Xtensa concept provides Not only a configurable architecture But also a design methodology The idea is to take the best of both the hardware and the software world in order to Have good performance Short Time-to-Market Xtensa processors can be used as parts of a system-on-chip architecture Other extendable cores exist like the NIOS II from Altera


Download ppt "Lecture 10 Hardware Accelerators Ingo Sander"

Similar presentations


Ads by Google