Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

Similar presentations


Presentation on theme: "High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from."— Presentation transcript:

1 High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne Wolf

2 High Performance Embedded Computing © 2007 Elsevier Topics Branching in VLIW (from last time) Memory systems.  Memory component models.  Caches and alternatives. Code compression.

3 High Performance Embedded Computing © 2007 Elsevier Branching in VLIW Processors Operations in the same VLIW as a branch must not depend on the branch outcome VLIWs may utilize static branch prediction  Uses profiling to predict branch direction  Have different instructions depending on if the branch is predicted to be taken or not taken VLIWs sometimes have specialized loop instructions – execute the instruction word or set of instruction words a specified number of time. VLIWs sometimes have predicated instructions  Execute instruction conditionally based on value in “predicate register”  cmpgt p1 = r1, r2;; sub (p1) r3 = r1, r2, sub (!p1) r3 = r2, r1;;

4 High Performance Embedded Computing © 2007 Elsevier Branching in VLIW Processors VLIWs often executes branches in two phases  Prepare the branch condition (store in a branch register)  Execute the branch based on the condition ## Implements: if (a > 0 && b < 0) { [ Block 1] } ## where a is in $r1, b is in $r2, and 0 is in r0 cmpgt$r3 = $r1, $r0## (a > 0) cmplt$r4 = $r2, $r0## (b < 0) ;; and$b0 = $r3, $r4 ;; br$b0, L1;; ## L1 starts Block 1

5 High Performance Embedded Computing © 2007 Elsevier Generic memory block

6 High Performance Embedded Computing © 2007 Elsevier Simple memory model Core array is n rows x m columns. Total area A = A r + A x + A p + A c. Row decoder area A r = a r m. Core area A x = a x mn. Precharge circuit area A p = a p m. Column decoder area A c = a c m.

7 High Performance Embedded Computing © 2007 Elsevier Simple energy and delay models  =  setup +  r +  x +  bit +  c. Total energy E = E D + E S.  Static energy component E S is a technology parameter.  Dynamic energy E D = E r + E x + E p + E c.

8 High Performance Embedded Computing © 2007 Elsevier Multiport memories - delay Delay vs. memory size and number of ports Multiport Memory R_addr1 R_addr2 W_addr W_data R_data1 R_data2

9 High Performance Embedded Computing © 2007 Elsevier Multiport memories - area Area of memories often grows with the square of the number of ports

10 High Performance Embedded Computing © 2007 Elsevier Kamble and Ghose cache power model

11 High Performance Embedded Computing © 2007 Elsevier Kamble/Ghose, cont’d. Cache is m-way set-associative, capacity of D bytes, T bits of tag, and L bytes of line, St status bits per line. The bit line energy is : The equation in the book is incorrect.

12 High Performance Embedded Computing © 2007 Elsevier Kamble/Ghose, cont’d. Word line energy: Output line energy: Address input lines:

13 High Performance Embedded Computing © 2007 Elsevier Shiue and Chakrabarti cache energy model add_bs: number of transitions on address bus per instruction. data_bs: number of transitions on data bus per instruction. word_line_size: number of memory cells on a word line. bit_line_size: number of memory cells on a bit line. E m : Energy consumption of a main memory access.  : technology parameters.

14 High Performance Embedded Computing © 2007 Elsevier Shiue/Chakrabarti, cont’d.

15 High Performance Embedded Computing © 2007 Elsevier Register files First stage in the memory hierarchy. When too many values are live, some values must be spilled onto cache/main memory and read back later.  Spills cost time, energy. Register file parameters:  Number of words.  Number of ports.

16 High Performance Embedded Computing © 2007 Elsevier Performance and energy vs. register file size. [Weh01] © 2001 IEEE

17 High Performance Embedded Computing © 2007 Elsevier Cache size vs. energy [Li98] © 1998 IEEE

18 High Performance Embedded Computing © 2007 Elsevier Cache parameters Cache size:  Larger caches hold more data, require more static energy, take area away from other functinos. Number of sets:  More independent references, more locations mapped onto each line. Cache line length:  Longer lines give more prefetching bandwidth, often result in higher energy consumption. What impact does each of these have on energy?

19 High Performance Embedded Computing © 2007 Elsevier Multilevel cache optimization Gordon-Ross et al adjust cache parameters in order:  Cache size.  Line size.  Associativity. Design cache size for first level, then second level; line size for first, then second level; associativity for first, then second level. Why vary the parameters in this order.

20 High Performance Embedded Computing © 2007 Elsevier Scratch pad memory Scratch pad is managed by software, not hardware.  Provides predictable access time.  Requires values to be allocated. Use standard read/write instructions to access scratch pad.

21 High Performance Embedded Computing © 2007 Elsevier Code compression Extreme version of instruction encoding:  Use variable-bit instructions.  Generate encodings using compression algorithms. Generally takes longer to decode. Can result in performance, energy, code size improvements. How? IBM CodePack (PowerPC) used Huffman encoding.

22 High Performance Embedded Computing © 2007 Elsevier Terms Compression ratio:  Compressed code size/uncompressed code size * 100%.  Must take into account all overheads.

23 High Performance Embedded Computing © 2007 Elsevier Wolfe/Chanin approach Object code is fed to lossless compression algorithm.  Wolfe/Chanin used Huffman’s algorithm. Compressed object code becomes program image. Code is decompressed on-the-fly during execution. Source code compiler Object code compressor Compressed object code

24 High Performance Embedded Computing © 2007 Elsevier Wolfe/Chanin execution Instructions are decompressed when read from main memory.  Data is not compressed or decompressed. Cache holds uncompressed instructions.  Longer latency to get instructions from memory. CPU does not require significant modifications. CPU decompressor cache memory

25 High Performance Embedded Computing © 2007 Elsevier Huffman coding Input stream is a sequence of symbols. Each symbol’s probability of occurrence is known. Construct a binary tree of probabilities from the bottom up.  Path from root to symbol gives code for that symbol.

26 High Performance Embedded Computing © 2007 Elsevier Wolfe/Chanin results [Wol92] © 1992 IEEE

27 High Performance Embedded Computing © 2007 Elsevier Compressed vs. uncompressed code Code must be uncompressed from many different starting points during branches. Code compression algorithms are designed to decode from the start of a stream. Compressed code is organized into blocks.  Uncompress at start of block. Unused bits between blocks constitute overhead. add r1, r2, r3 mov r1, a bne r1, foo uncompressed compressed

28 High Performance Embedded Computing © 2007 Elsevier Block structure and compression Trade-off:  Compression algorithms work best on long blocks.  Program branching works best with short blocks. Labels in program move during compression. Two approaches:  Wolfe and Chanin used branch table to translate branches during execution (adds code size).  Lefurgy et al. patched compressed code to refer branches to compressed locations.

29 High Performance Embedded Computing © 2007 Elsevier Compression ratio vs. block size [Lek99b] © 1999 IEEE

30 High Performance Embedded Computing © 2007 Elsevier Pre-cache compression Decompress as instructions come out of the cache. One instruction may be decompressed many times. Program has smaller cache footprint. Why might a different type of decompression engine be needed?

31 High Performance Embedded Computing © 2007 Elsevier Compression algorithms There are a large number of algorithms for compressing data. These algorithms were designed for different constraints:  Large text files.  No real-time or power constraints. Evaluate existing algorithms under the requirements of code compressions and develop new algorithms  Several of these new algorithms are discussed in the textbook

32 High Performance Embedded Computing © 2007 Elsevier Code and data compression Unlike (non-modifiable) code, data must be compressed and decompressed dynamically.  Compress data before it gets written back to main memory Can substantially reduce the main memory or cache footprints. Requires different trade-offs.

33 High Performance Embedded Computing © 2007 Elsevier Lempel-Ziv algorithm Dictionary-based method. Decoder builds dictionary during decompression process. LZW variant uses a fixed-size buffer. Source text Uncompressed source CoderDictionary DecoderDictionary Compressed text

34 High Performance Embedded Computing © 2007 Elsevier Lempel-Ziv example

35 High Performance Embedded Computing © 2007 Elsevier MXT Tremaine et al. has 3-level cache system.  Level 3 is shared among several processor, connected to main memory.  Data and code are compressed/uncompressed as they move between main memory and level 3 cache. Uses a variant of Lempel-Ziv 1977 algorithm.  All compression engines share the same dictionary.  Typically, 1 KB blocks are divided into four 256-byte blocks for compression. These can be decompressed in parallel.


Download ppt "High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from."

Similar presentations


Ads by Google