Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Embedded Systems Rabie A. Ramadan 3.

Similar presentations


Presentation on theme: "Introduction to Embedded Systems Rabie A. Ramadan 3."— Presentation transcript:

1 Introduction to Embedded Systems Rabie A. Ramadan rabieramadan@gmail.com http://www.rabieramadan.org/classes/2014/embedded/ 3

2 Memory 2 Memory Component Models Cache Memory Mapping \

3 Memory Component Models 3

4 Multiport memories 4 Larger memory structures can be built from memory blocks. Memory Mapping is required

5 Register Files 5 The size of the register file is fixed when the CPU is predesigned. Register file size is a key parameter in CPU design that affects code performance and energy consumption as well as the area of the CPU. If the register file is too small, The program must spill values to main memory: The value is written to main memory and later read back from main memory. Spills cost both time and energy because main memory accesses are slower and more energy-intense than register file accesses

6 Register Files 6 If the register file is too large, then it consumes static energy as well as taking extra chip area that could be used for other purposes.

7 Caches 7 When designing an embedded system, we need to pay extra attention to the relationship between the cache configuration and the programs that use it. Too-small caches result in excessive main memory accesses; Too-large caches consume excess static power. Longer cache lines provide more prefetching bandwidth, which is useful in some algorithms but not others.

8 Caches 8 Line size affects prefetching behavior— Programs that access successive memory locations can benefit from the prefetching induced by long cache lines. Long lines can also, in some cases, provide reuse for very small sets of locations. Cache Memory Mapping is another issue

9 Wolfe and Lam Classification to Behavior of Arrays 9

10 Caches 10 Several groups, have proposed configurable caches whose configuration can be changed at runtime. Additional multiplexers and other logic allow a pool of memory cells to be used in several different cache configurations.

11 Scratch Pad Memories 11 Cache is designed to move a relatively small amount of memory close to the processor. Caches use hardwired algorithms to manage the cache contents Hardware determines when values are added or removed from the cache. Scratch pad memory is located parallel to the cache. the scratch pad does not include hardware to manage its contents.

12 Scratch pad memory Part of the memory address space controlled by the processor Scratch pad is managed by software, not hardware. Provides predictable access time. Requires values to be allocated. Use standard read/write instructions to access scratch pad.

13 Memory Maps 13 A memory map for a processor defines how addresses get mapped to hardware. The total size of the address space is constrained by the address width of the processor. A32-bit processor, for example, can address 2 32 locations, or 4 gigabytes (GB), assuming each address refers to one byte.

14 An ARM Cortex TM - M3 architecture, 14 Separates addresses used for program memory (labeled A) from those used for data memory (B and D). Memories accessed via separate buses, Permitting instructions and data to be fetched simultaneously. Effectively doubles the memory bandwidth. Such a separation of program memory from data memory is known as a Harvard architecture.

15 An ARM Cortex TM - M3 architecture 15 Includes a number of on-chip peripherals (C) Devices that are accessed by the processor using some of the memory addresses Timers, ADCs, UARTs, and other I/O devices Each of these devices occupies a few of the memory addresses by providing memory-mapped registers

16 16

17 Memory Hierarchy The idea Hide the slower memory behind the fast memory Cost and performance play major roles in selecting the memory.

18 Hit Vs. Miss Hit The requested data resides in a given level of memory. Miss The requested data is not found in the given level of memory Hit rate The percentage of memory accesses found in a given level of memory. Miss rate The percentage of memory accesses not found in a given level of memory.

19 Hit Vs. Miss (Cont.) Hit time The time required to access the requested information in a given level of memory. Miss penalty The time required to process a miss, Replacing a block in an upper level of memory, The additional time to deliver the requested data to the processor.

20 Miss Scenario The processor sends a request to the cache for location X if found  cache hit If not  try next level When the location is found  load the whole block into the cache Hoping that the processor will access one of the neighbor locations next. One miss may lead to multiple hits  Locality Can we compute the average access time based on this memory Hierarchy?

21 Average Access Time Assume a memory hierarchy with three levels (L1, L2, and L3) What is the memory average access time? h1  hit at L1 (1-h1)  miss at L1 t1  L1 access time h2  hit at L2 (1-h2)  miss at L2 t2  L2 access time h3  hit at L3=100% (1-h3)  miss at L3 t3  L3 access time

22 Cache Mapping Schemes

23 Cache memory is smaller than the main memory Only few blocks can be loaded at the cache The cache does not use the same memory addresses Which block in the cache is equivalent to which block in the memory? The processor uses Memory Management Unit (MMU) to convert the requested memory address to a cache address

24 Direct Mapping Assigns cache mappings using a modular approach j = i mod n j cache block number i memory block number n number of cache blocks Memory Cache

25 Example Given M memory blocks to be mapped to 10 cache blocks, show the direct mapping scheme? How do you know which block is currently in the cache?

26 Direct Mapping (Cont.) Bits in the main memory address are divided into three fields. Word  identifies specific word in the block Block  identifies a unique block in the cache Tag  identifies which block from the main memory currently in the cache

27 Example Consider, for example, the case of a main memory consisting of 4K blocks, a cache memory consisting of 128 blocks, and a block size of 16 words. Show the direct mapping and the main memory address format? Tag

28 Example (Cont.)

29 Direct Mapping Advantage Easy Does not require any search technique to find a block in cache Replacement is a straight forward Disadvantages Many blocks in MM are mapped to the same cache block We may have others empty in the cache Poor cache utilization

30 Group Activity 1 Consider, the case of a main memory consisting of 4K blocks, a cache memory consisting of 8 blocks, and a block size of 4 words. Show the direct mapping and the main memory address format?

31 Group Activity 2 Given the following direct mapping chart, what is the cache and memory location required by the following addresses: 311263 4202

32 Fully Associative Mapping Allowing any memory block to be placed anywhere in the cache A search technique is required to find the block number in the tag field

33 Example We have a main memory with 2 14 words, a cache with 16 blocks, and blocks is 8 words. How many tag & word fields bits? Word field requires 3 bits Tag field requires 11 bits  2 14 /8 = 2048 blocks

34 Fully Associative Mapping Advantages Flexibility Utilizing the cache Disadvantage Required tag search Associative search  Parallel search Might require extra hardware unit to do the search Requires a replacement strategy if the cache is full Expensive

35 N-way Set Associative Mapping Combines direct and fully associative mapping The cache is divided into a set of blocks All sets are the same size Main memory blocks are mapped to a specific set based on : s = i mod S s specific to which block i mapped S total number of sets Any coming block is assigned to any cache block inside the set

36 N-way Set Associative Mapping Tag field  uniquely identifies the targeted block within the determined set. Word field  identifies the element (word) within the block that is requested by the processor. Set field  identifies the set

37 Group Activity Compute the three parameters (Word, Set, and Tag) for a memory system having the following specification: Size of the main memory is 4K blocks, Size of the cache is 128 blocks, The block size is 16 words. Assume that the system uses 4-way set- associative mapping.

38 Answer

39 N-way Set Associative Mapping Advantages : Moderate utilization to the cache Disadvantage Still needs a tag search inside the set

40 If the cache is full and there is a need for block replacement, Which one to replace?

41 Cache Replacement Policies Random Simple Requires random generator First In First Out (FIFO) Replace the block that has been in the cache the longest Requires keeping track of the block lifetime Least Recently Used (LRU) Replace the one that has been used the least Requires keeping track of the block history

42 Cache Replacement Policies (Cont.) Most Recently Used (MRU) Replace the one that has been used the most Requires keeping track of the block history Optimal Hypothetical Must know the future

43 Example Consider the case of a 4X8 two-dimensional array of numbers, A. Assume that each number in the array occupies one word and that the array elements are stored column-major order in the main memory from location 1000 to location 1031. The cache consists of eight blocks each consisting of just two words. Assume also that whenever needed, LRU replacement policy is used. We would like to examine the changes in the cache if each of the direct mapping techniques is used as the following sequence of requests for the array elements are made by the processor:

44 Array elements in the main memory

45

46 Conclusion 16 cache miss No single hit 12 replacements Only 4 cache blocks are used

47 Group Activity Do the same in case of fully and 4-way set associative mappings ?

48 Memory Models 48 Stacks A stack is a region of memory that is dynamically allocated to the program in a last-in, first-out (LIFO) pattern. A stack pointer (typically a register) contains the memory address of the top of the stack. Stacks are typically used to implement procedure calls.

49 Memory Models-Stacks 49 In C, the compiler produces code that pushes onto the stack the location of: instruction to execute upon returning from the procedure, the current value of some or all of the machine registers, the arguments to the procedure, sets the program counter equal to the location of the procedure code. Stack Frame The data for a procedure that is pushed onto the stack. When a procedure returns: the compiler pops its stack frame, retrieving the program location at which to resume execution.

50 Memory Models-Stacks 50 It can be disastrous if the stack pointer is incremented beyond the memory allocated for the stack - stack overflow Result in overwriting memory that is being used for other purposes. Becomes particularly difficult with recursive programs, where a procedure calls itself. recursion Embedded software designers often avoid using recursion to circumvent this difficulty.

51 misuse or misunderstanding of the stack 51 When calling foo (), c refers to the return address – after returning the stack frame c becomes address of b which cause addressing problem.

52 Memory Protection Units 52 A key issue in systems that support multiple simultaneous tasks is preventing one task from disrupting the execution of another. Many processors provide memory protection in hardware. Tasks are assigned their own address space, and if a task attempts to access memory outside its own address space, a segmentation fault or other exception results. This will typically result in termination of the offending application.

53 Memory Models- Dynamic Memory Allocation 53 General-purpose software applications often have indeterminate requirements for memory, depending on parameters and/or user input. To support such applications, computer scientists have developed dynamic memory allocation schemes, a program can at any time request that the operating system allocate additional memory. The memory is allocated from a data structure known as a heap, which facilitates keeping track of which portions of memory are in use by which application.

54 Memory Models- Dynamic Memory Allocation 54 Memory allocation occurs via an operating system call (such as malloc in C). When the program no longer needs access to memory that has been so allocated, it deallocates the memory (by calling free in C). it is possible for a program to inadvertently accumulate memory that is never freed. This is known as a memory leak, for embedded applications, which typically must continue to execute for a long time, it can be disastrous. The program will eventually fail when physical memory is exhausted.

55 Memory Models- Dynamic Memory Allocation 55 memory fragmentation occurs when a program chaotically allocates and deallocates memory in varying sizes. A fragmented memory has allocated and free memory chunks interspersed, and often the free memory chunks become too small to use. In this case, defragmentation is required. Defragmentation and garbage collection are both very problematic for real-time systems. Straightforward implementations of these tasks require all other executing tasks to be stopped while the defragmentation or garbage collection is performed. Implementations using such “stop the world” techniques can have substantial pause times, running sometimes for many milliseconds.

56 Programs 56

57 Topics 57 Code Compression Code generation and back-end compilation. Memory-oriented software optimizations.

58 Code Compression 58 Memory is one of the key driving factors in embedded system design larger memory indicates an increased chip area, more power dissipation, and higher cost. memory imposes constraints on the size of the application programs. Code compression techniques address the problem by reducing the program size.

59 Traditional Code Compression 59 Compression is done off-line (prior to execution) Compressed program is loaded into the memory. Decompression is done during the program execution (online).

60 Dictionary-based Approach 60 take the advantage of commonly occurring instruction sequences by using a dictionary The repeating occurrences are replaced by a codeword that points to the index of the dictionary that contains the pattern.

61 Improved Dictionary-based Approach 61 improve the dictionary based compression technique by considering mismatches. Step1: Determine the instruction sequences that are di ff erent in few bit positions (hamming distance) Step 2: Store that information in the compressed program Step 3: Update the dictionary (if necessary). The compression ratio will depend on how many bit changes are considered during compression

62 Example 62 This example considers only 1-bit change the third pattern (from top) in the original program is di ff erent from the first dictionary entry (index 0) on the sixth bit position (from left). The compression ratio for this example is 95%.

63 CODE COMPRESSION USING BIT- MASKS 63 Your Reading Homework Link A presentation is required – I will be selecting randomly one of you to explain it next time.

64 Memory Optimization Techniques 64

65 PLATFORM-INDEPENDENT CODE TRANSFORMATIONS 65 Code Rewriting Techniques for Access Locality and Regularity Consisting of loop (and sometimes also data flow) transformations, Should this algorithm be implemented directly?

66 Code Rewriting Techniques for Access Locality and Regularity 66 result in high storage and bandwidth requirements (assuming that N is large), b[] signals have to be written to an off-chip background memory in the first loop and read back in the second loop.

67 Code Rewriting Techniques for Access Locality and Regularity 67 Rewriting the code using a loop merging transformation, gives the following: b[] signals can be stored in registers up to the end of the accumulation, since they are immediately consumed after they have been produced. In the overall algorithm, this reduces memory bandwidth requirements significantly,

68 Code Rewriting Techniques to Improve Data reuse 68 it is important to optimize data transfers and storage to utilize the memory hierarchy efficiently The compiler literature up to now focused on improving data reuse by performing loop transformations. hierarchical data reuse copies are added to the code, exposing the different levels of reuse Depends on the knowledge about the memory hierarchy and their sizes. Still hard to implement as well as to understand

69 Code Rewriting Techniques to Improve Data reuse 69 Only Part of the arrays are accessed in the internal loops Make them ready in buffers

70 Memory Estimation 70 One of the techniques is based on live elements (Signals) Requires a dependency graph In computer sciences a dependency graph is directed graph representing dependencies of several instructions towards each other

71 Example 71

72 Lets build the Dependency graph 72

73 73

74 Dependences 74 Instruction Dependency The operation performed by a stage depends on the operation(s) performed by other stage(s). E.g. Conditional Branch  Instruction I 4 can not be executed until the branch condition in I 3 is evaluated and stored.  The branch takes 3 units of time

75 Dependences 75  Data Dependency:  A source operand of instruction I i depends on the results of executing a proceeding I j i > j  E.g.  I j can not be fetched unless the results of I i are saved.

76 Data Dependency  Write after write  Read after write  Write after read  Read after read  does not cause stall

77 Read after write

78 Example Consider the execution of the following sequence of instructions on a five-stage pipeline consisting of IF, ID, OF, IE, and IS. Show all types of data dependency

79 Answer

80 Memory Modeling 80 Based on the dependency and data flow graph, All variables that need to be preserved over more than one control step are stored in registers. The minimization of the number of registers assigned to the variables because the register count impacts the area of the resulting design.

81 Register Allocation by Graph Coloring 81 The life time of each variable is computed first, a graph is constructed whose nodes represent variables, the existence of an edge indicates that the life times overlap, i.e., they cannot share the same register; a register can only be shared by variables with nonoverlapping life times. Thus, the problem of minimizing the register count for a given set of variables and their life times is equivalent to the graph coloring problem. Assign colors to each node of the graph such that the total number of colors is minimum and no two adjacent nodes share the same color


Download ppt "Introduction to Embedded Systems Rabie A. Ramadan 3."

Similar presentations


Ads by Google