Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lower Power Algorithm for Multimedia Systems

Similar presentations


Presentation on theme: "Lower Power Algorithm for Multimedia Systems"— Presentation transcript:

1 Lower Power Algorithm for Multimedia Systems
성균관대학교 조 준 동 SungKyunKwan Univ.

2 Contents Algorithmic Effects on Low Power Low Power Management
Low Power Applications Low Power Video Processor Single Chip Video Camera Vector Quantization Data Encoding CDMA Searcher Viterbi Decoder SungKyunKwan Univ.

3 Low Power Algorithm SungKyunKwan Univ.

4 Algorithm Selection Example: 8x8 matrix DCT SungKyunKwan Univ.

5 Strength Reduction: DIGLOG multiplier
1st Iter 2nd Iter 3rd Iter Worst-case error % % % Prob. of Error<1% 10% % % With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case) SungKyunKwan Univ.

6 Logarithmic Number System
--> Significant Strength Reduction SungKyunKwan Univ.

7 Switching Activity Reduction
(a) Average activity in a multiplier as a function of the constant value (b) A parallel and serial implementations of an adder tree. SungKyunKwan Univ.

8 System-Level Solutions
System management, System partitioning, Algorithm selection Precompute physical capacitance of Interconnect and switching activity (number of bus accesses) Regularity: to minimize the power in the control hardware and the interconnection network. Modularity: to exploit data locality through distributed processing units, memories and control. Spatial locality: an algorithm can be partitioned into natural clusters based on connectivity Temporal locality:average lifetimes of variables (less temporal storage, probability of future accesses referenced in the recent past). Few memory references: since references to memories are expensive in terms of power. SungKyunKwan Univ.

9 System-Level Solutions - cont.
Simulator: Instruction-level Energy Estimation Software: Energy Efficient Algorithms OS: Voltage Scheduling Algorithms OS: Multiprocessing for Energy Microprocessor: Dynamic Caches SungKyunKwan Univ.

10 Processor Systems:high Power
Thinkpad (Pentium) ® 0.3 Hours/AA InfoPad (ARM) ® 0.8 Hours/AA Toshiba Portable (486) ® 0.9 Hours/AA Newton (ARM) ® 2.0 Hours/AA Operations per Battery Life: Minimize Energy Consumed per Operation Operations per Second: Maximize Throughput º Operations/ second SungKyunKwan Univ.

11 DPM vs SPM Identify power hungry modules and look for
opportunities to reduce power DPM (Dynamic Power Management): stops the clock switching of a specific unit generated by clock generators. SPM (Static Power Management): When the system remains idle for a significant period time, then it is shut-down. SungKyunKwan Univ.

12 Vdd vs Delay Use Variable Voltage Scaling or Scheduling for Real-time Processing Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing. Scale down device sizes to compensate for delay (Interconnects do not scale proportionately and can become dominant) SungKyunKwan Univ.

13 Power PC 603 Strategy Baseline: use right supply and right frequency to each part of the system If one has to wait on the occurence of some input, only a small circuit could wait and wake-up the main circuit when the input occurs. PowerPC 603 is a 2-issue (2 instructions read at a time) with 5 parallel Execution units. 4 modes: Full on mode for full speed Doze mode in which the execution units are not running Nap mode which also stops the bus clocking and the Sleep mode which stops the clock generator Sleep mode which stops the clock generator with or without the PLL (20-100mW). SungKyunKwan Univ.

14 Power PC 603 Power Management
SungKyunKwan Univ.

15 TI Structures SungKyunKwan Univ.
Two DSPs: TMS320C541, TMS320C542 reduce power and chip count and system cost for wireless communication applications C54X DSPs, 2.7V, 5V, Low-Power Enhanced Architecture DSP (LEAD) family: Three different power down modes, these devices are well-suited for wireless communications products such as digital cellular phones, personal digital assistants, and wireless modem,low power on voice coding and decoding The TMS320LC548 features: 15-ns (66 MIPS) or 20-ns (50 MIPS) instruction cycle times 3.0- and 3.3-V operation 32K 16-bit words of RAM and 2K 16-bit words of boot ROM on-chip Integrated Viterbi accelerator that reduces Viterbi butterfly update in four instruction cycles for GSM channel decoding Powerful single-cycle instructions (dual operand, parallel instructions, conditional instructions) SungKyunKwan Univ.

16 InfoPad Architecture, UC-Berkeley
Internet Wireless Basestation “PadServer” Speech Recognizer Web Browser Transmit audio and raw bitmaps across the wireless link Example: Hand-held speech-enabled web-browser Maintain state in the network, not on the Pad InfoPad Perform all computation in the network to minimize client energy dissipation SungKyunKwan Univ.

17 InfoPad Hardware Flexibility
Main data-flow handled by custom low-power ASICs Embedded software responsible for high-level functions RX Packet Packet Header Frame- buffer update Only header sent to microprocessor 10 MIPS μProcessor Control Statistics Reliability Debugging Radio Frame Buffer Entire packet routed to dedicated hardware Use hardware/software integration to provide energy-efficient high-level functionality SungKyunKwan Univ.

18 Multimedia I/O Terminal.
SungKyunKwan Univ.

19 Multimedia I/O terminal
SungKyunKwan Univ.

20 InfoPad Evolution Total Power: ~7 W Where did the power go? Inefficient implementation Energy- Efficient Processors Intercom InfoPad Commercial DC/DC No local computation? Commercial radios High-level system design optimizes complete solution and drives new research SungKyunKwan Univ.

21 Power-Down Techniques
SungKyunKwan Univ.

22 Low Power Memory SungKyunKwan Univ.

23 Low Power Video Processor
Uzi Zangi, Technion - VLSI Systems Research Center, 1997 Asynchronous logic to save power Didn’t work because:Slow design (13.5MHz) &Small circuit (<100K gates) : clock load is small.Adding Async. control costs more then clocking. Gated clock Didn’t work because: Frequency is very low (13.5MHz). Register activity is very high. No need for clock tree. SungKyunKwan Univ.

24 Minimizing bus switching
Transfer the value or it’s negative on the bus, according to the minimum number of toggle bits. Add one bit that will indicate the polarity of the bus. Good for buses with: large number of bits (more than 10). High capacitance (more then 2pF). High toggle activity (more then 1/2). Overheads: Routing of one more bit. Extra logic for the decision (timing, area). SungKyunKwan Univ.

25 Minimizing bus switching (Cont.)
Didn’t work because: Largest bus is 8bit. Capacitance less than 1pF. Toggle activity not very high. SungKyunKwan Univ.

26 Power Reduction in InfoPad
SungKyunKwan Univ.

27 Power Management by Gated Clock
Power Management Scheme by Enabling Clock Power Management Scheme by adding Clock Generation block SungKyunKwan Univ.

28 Method That Works: Pixel Differentials
Pixel value area locality. This is exploited most heavily in compression (save on storage and transmission). Most of the functions are linear, able to work on differences. The entire algorithm was rewritten (interpolations, filters, matrices, etc.) New algorithm differs from original by no more then 1 lsb bit per pixel. SungKyunKwan Univ.

29 Methodology SungKyunKwan Univ. C++ Simulator Algorithm Image Image
Compare Verilog Simulator RTL Image 0.35 Lib Compass P&R Cadence Opus Synopsys Netlist Currents, power Epic Powermill Spice Netlist SungKyunKwan Univ.

30 Pixel Difference SungKyunKwan Univ.

31 Pixel Differentials Algorithm Results
SungKyunKwan Univ.

32 Summary Attempted to save power on a battery-operated chip by application specific algorithmic/architectural techniques: Async. Logic, Gated clock, Minimizing bus switching. All Attempts failed. These methods may still apply to very large, very fast chips, and on variable load application. Successfully applied an algorithmic change, inspired by image compression. It may not work on non-compressible data but works exceptionally well on images. Easily saved 80% power, potentially can save more than 90%. SungKyunKwan Univ.

33 A SINGLE-CHIP DIGITAL CAMERA H. Teresa H
A SINGLE-CHIP DIGITAL CAMERA H. Teresa H. Meng, “Low-Power Wireless Video System” , IEEE Communication Magazine, June, 1998 Given the recent development in CMOS RF transceiver design, wireless transmission at a bandwidth in excess of 10Mb/s will soon become possible using next-generation CMOS technology. The design of a low-power large-scale parallel MPEG2 encoder architecture to be used in a single-chip digital CMOS video camera. The single-chip digital camera architecture includes a 640 x 480 array of CMOS photo diodes, embedded DRAM for storing four frames of color data, and parallel array processor for video signal processing The parallel processor architecture is designed to implement highly computationally intensive image and video processing tasks such as color conversion , discrete cosine transform(DCT), and motion estimation for MPGE2. SungKyunKwan Univ.

34 A SINGLE-CHIP DIGITAL CAMERA
SungKyunKwan Univ.

35 A SINGLE-CHIP DIGITAL CAMERA
Energy per operation at a 1.5V supply in 0.8m CMOS technology SungKyunKwan Univ.

36 A SINGLE-CHIP DIGITAL CAMERA
Design Consideration The proposed architecture considers three algorithms commonly used in video coding standards : red-green-blue(RGB)-to-yellow-ultraviolet (YUV) conversion, discrete cosign transform(DCT), and motion estimation To reduce power consumption, as many parallel processors as practically feasible should be used to reduce the clock frequency, because a reduced clock frequency implies a lower supply voltage. For MPEG-2 encoding, the computational demand required for motion estimation(1.6 BOPS for 30 frames/s based on the algorithm proposed by Chalidabhongese and Kuo) limits the number of columns in each processor domain to 16, because otherwise the required clock speed for each processor would be too high for a low-power design SungKyunKwan Univ.

37 A SINGLE-CHIP DIGITAL CAMERA
PERFORMANCE In order to sustain this computational demand, each processor is required to run at a clock frequency equal to or higher than 40 MHz. When implemented in a 0.2 CMOS technology, a 1V supply voltage should be more than enough to support a 40MHz operation Under these condition, this parallel processor architecture delivers a processing of 1.6 BOPS with a power consumption of 40mW SungKyunKwan Univ.

38 Vector Quantization Lossy compression technique which exploits the correlation that exists between neighboring samples and quantizes samples together SungKyunKwan Univ.

39 Complexity of VQ Encoding
The distortion metric between an input vector X and a codebook vector C_i is computed as follows: Three VQ encoding algorithms will be evaluated: full search, tree search and differential codebook tree-search. SungKyunKwan Univ.

40 Full Search Brute-force VQ: the distortion between the input vector and every entry in the code-book is computed, and the codeindex that corresponds to the minimum distortion is determined and sent over to the decoder. For each distortion computation, there are 16 8-bit memory accesses (to fetch the entries in the codeword), 16 subtractions, 16 multiplications, 15 additions. In addition, the minimum of 256 distortion values, which involves 255 comparison operations, must be determined. SungKyunKwan Univ.

41 Tree-structured Vector Quantization
If for example at level 1, the input vector is closer to the left entry, then the right portion of the tree is never compared below level 2 and an index bit 0 is transmitted. Here only 2 x log = 16 distortion calculations with 8 comparisons SungKyunKwan Univ.

42 Differential Codebook Tree-structure Vector Quantization
The distortion difference b/w the left and right node needs to be computed. This equation can be manipulated to reduce the number of operations . SungKyunKwan Univ.

43 Comparisons The number of memory access operations can be reduced; that is, by changing the contents of the code-book through computational transformations, the number of switching events - number of multiplications, additions/subtractions and memory accesses- can be reduced. SungKyunKwan Univ.

44 Multiplication with Constants
Techniques and tools have been developed to scale coefficients so as to minimize the number of 1’s in the coefficients so as to minimize the number of shift-add operations. SungKyunKwan Univ.

45 Gated clocks to shut down modules when not used.
SungKyunKwan Univ.

46 Lower Power Data Encoding
S.S.Chun and J.D.Cho’97 허프만 부호화 알고리즘에 의하여 발생된 압축률을 유지하면서 허프만코드를 재구성하여 스위칭 동작 횟수를 줄이는 방법 공통된 서브 시퀀스를 많이 갖는 서브 스트림에 그레이 코드와 같은 스위칭 횟수가 적은 부호화 방식을 채택하는 것이다. RISC 인스트럭션 어드레싱 방식중 바이너리코드 어드레싱 방식에 비해서 그레이코드 어드레싱 방식을 사용할 경우 50%까지의 전력감축 효과를 나타낸다 SungKyunKwan Univ.

47 Gray Code 두 개의 n 차원(n bit) 벡터 U = u_1, u_2, … , u_n 과 V = v_1, v_2, … , v_n 의 해밍 거리를 h(U,V) = SUM from i=1 to n (u_i, v_i ) 로 정의하자. 여기서 (u_i v_i ) 는 u와 v의 bit 값이 다르면 1이 되고 그렇지 않으면 0이 된다. 이것은 n차원 hypercube G의 변을 따라갈 때의 거리로 표현 할 수도 있다. Gray code = shortest path in G 허프만 코드는 문자의 코드 길이가 다를 수 있으며 prefix-free코드를 유지하여야 하기 때문에 정확한 그레이 코드로 변환하는 것은 불가능하며 비트 변화량을 최소화하기 위한 압축 부호화가 필요하게 된다. SungKyunKwan Univ.

48 2-D Traveling Salesman Problem
제안된 문제는 문자의 인접 빈도수가 많은 문자쌍에 해밍 거리가 작은 코드쌍을 할당하는 문제이기 때문에 두 개 이상의 TSP를 동시에 처리하는 새로운 문제로 표현된다. Using heuristic: 10% reduction in switching activity for random un-correlated data SungKyunKwan Univ.

49 Lower Power CDMA Searcher
S. Kim and J.D.Cho 성균관대학교 SungKyunKwan Univ.

50 Searcher (Using a Common Double Dwell Method)
CDMA 시스템의 송수신간의 정확한 PN부호의 동기를 위한 초기 동기 포착 과정. SungKyunKwan Univ.

51 Operation Flow 기지국에서 전송하는 파일럿 채널을 단말기에서 발생된 PN부호열과 역확산 과정 수행.
역확산된 결과를 동기 누적 횟수 Nc 만큼 누적한 후 에너지 계산 과정을 거침 (제곱 연산). 에너지 계산 결과값들은 첫번째 임계치( )와 비교하여 초과할 경우 뒷 단에서 비동기 누적(Nn) 수행. 그렇지 못할 경우 PN부호열을 한 칩 빨리 발생시키고 입력되는 신호에 대하여 앞의 과정을 반복. 비동기 누적을 거친 결과값을 두번째 임계치( )와 비교. 를 초과하면 탐색 과정을 종료하고, 그렇지 않을 경우 PN부호열을 한 칩 빨리 발생시키고 앞의 과정을 반복. SungKyunKwan Univ.

52 Data Flow Graph of Searcher Operation
동기 누적단 덧셈 과정 4회 에너지 계산단 곱셈 과정 2회 SungKyunKwan Univ.

53 Rescheduled Data Flow Graph
동기 누적단 Carry Save Adder (or 3 Iinput ALU) 사용 임계치 비교 Pre-computation 적용 에너지 계산단 Data Flow 순서를 변화하여 곱셈 과정을 줄임 SungKyunKwan Univ.

54 Pre-computation Power saving
Reduces power dissipation of combinational logic Reduces internal power to precomputed registers Cost Increase area Impact circuit timing Increase design complexity number of bits to precompute Testability may generate redundant logic SungKyunKwan Univ.

55 Pre-computation Precomputation for external idleness : M. Alidina, 1994 A comparator example : Shrinivas Devadas, 1994 SungKyunKwan Univ.

56 Low Power Comparator YI와 YQ의 MSB는 절대값의 signed bit이며, 모두 ‘0’임.
MSB를 제외한 상위 2bit를 이용하여 pre-computation을 실시. Pre-computation의 결과에 의해 |YI|와 |YQ| 중 큰 값을 선택. 임계치 θ1과 비교시 comparator대신 multiplexter를 사용. SungKyunKwan Univ.

57 Three Input ALU ( Ovadia Bat-Sheva, 1998 )
The three input ALU consumes much less power than an ALU and an ASU A drawback of using a 3IALU is the added complexity in calculating the carry and overflow. SungKyunKwan Univ.

58 실험 결과 및 결론 IS-95기반의 DS/CDMA 시스템의 단말기에 사용하기위한 MSM (Mobile Station Modem) 칩의 탐색자 (Searcher Engine)에 대한 RTL수준 저전력 설계 구현. 동작 주파수 : 12.5MHz Data flow graph를 사용하여 rescheduling, pre-computation 및 strength reduction등을 적용하여, area와 power를 각각 최대 67.68%, 41.35% 감소 시킴. SungKyunKwan Univ.

59 Lower Power Viterbi Decoder
J.H. Ryu and J.D.Cho 성균관대학교 SungKyunKwan Univ.

60 Viterbi Decoder Convolutional Encoder K = 3 (Constraint Length)
R = 1/2 (Rate) SungKyunKwan Univ.

61 Viterbi Decoder Information sequence : U = (0,0,1,0,1,0,...)
Output codeword : V = (00,00,11,10,00,10,...) SungKyunKwan Univ.

62 Viterbi Decoder Viterbi Decoder SungKyunKwan Univ.

63 Viterbi Decoder Branch Metric Unit(BMU) : The branch metrics measure the difference the received symbol and the symbol that causes the transitions between states in the trellis. Add-Compare-Select Unit(ACSU) : To find the survivor path entering each state, the branch metric of a given transition is added to its corresponding partial path metric(PM) stored in the path metric memory (PMM). This new partial path metric is compared with all the other new partial metric corresponding to all the other transitions entering that state. The transition that has the minimum partial path metric is chosen to be the survivor path of the state. The path metric of the survivor path of each state is updated and stored back into the PMM. Survivor memory Unit(SMU) : The survivor path are stored in the SMU. A traceback mechanism is applied on the SMU during the decoding stage to output the decoded data. SungKyunKwan Univ.

64 Viterbi Decoder Low power ACSU VLSI architecture
Conventional ACSU VLSI architecture Butterfly structure SungKyunKwan Univ.

65 Viterbi Decoder Architecture of conventional ACSU SungKyunKwan Univ.

66 Viterbi Decoder [SKKU. Solution]
Algorithm The area and power of the lower power ACSU design are reduced by 20% and 30%, respectively, comparing with the conventional ACSU design SungKyunKwan Univ.

67 Viterbi Decoder [SKKU. Solution]
Low power ACSU VLSI architecture [C-Y Tsui, ISLPED’99] SungKyunKwan Univ.

68 Viterbi Decoder [SKKU. Solution]
Glitch minimization [Raghunathan, DAC’96] (a) Lower power ACSU architecture (b) Conventional ACSU architecture The power consumption of architecture (a) is larger than that of architecture (b) by more than 17% because of glitch power dissipation SungKyunKwan Univ.

69 Viterbi Decoder [SKKU. Solution]
Glitches in control logic SungKyunKwan Univ.

70 Viterbi Decoder Low power traceback VLSI architecture
Systolic Viterbi, traceback decoder[J. Sparso’91] SungKyunKwan Univ.

71 Viterbi Decoder Received codeword : V = (00,00,11,10,00,10,...)
SungKyunKwan Univ.

72 Viterbi Decoder SungKyunKwan Univ.

73 Viterbi Decoder SungKyunKwan Univ.

74 Viterbi Decoder SungKyunKwan Univ.

75 Viterbi Decoder Systolic array decoder의 문제점
The systolic array viterbi decoder is organized to input the decision vector and the smallest path metric out of the ACSU and to output the decode bit by shifting every register for every cycle. This system consumes a great dynamic power consumption due to switching activities of registers which is almost 80% of the total power consumption because every data in TBU shifts for every cycle. SungKyunKwan Univ.

76 Viterbi Decoder [SKKU. Solution]
Our low power trace-back unit SungKyunKwan Univ.

77 Viterbi Decoder [SKKU. Solution]
SungKyunKwan Univ.

78 Viterbi Decoder [SKKU. Solution]
SungKyunKwan Univ.

79 Viterbi Decoder [SKKU. Solution]
After decision vector and the smallest path metric generated from ACSU are transferred to the Control Block (CB), the CB outputs the decision vector and the smallest path metric with the right cycle using a counter and a multiplexer. The register array, which stores the value of trace-back from the CB, was provided to finally output decoded bit, not by shifting all higher 4-bit decision vector as in the classical TBU, but by shifting the lower 2-bit only, which is the smallest path metric, to the left SungKyunKwan Univ.

80 Viterbi Decoder [SKKU. Solution]
Experimental Result (area 11% , power 40% ) SungKyunKwan Univ.

81 Viterbi Decoder [Stanford Solution]
Low Power Asynchronous Viterbi Decoder [Y.h.Lee , Stanford] Algorithm SungKyunKwan Univ.

82 Viterbi Decoder [Stanford Solution]
초기화: 구속장의 5배의 trellis를 traceback하고, 그 경로를 저장한다. Loop A. 추적과 비교 : 임의의 초기 스테이트를 선택해 trace back을 시작 한다. 동시에, route를 추적해 나가면서 각 node에서 저장된 route와 비교한다. B. 비교 값이 같으면 추적을 멈추고 저장된 route를 버린다. 같지 않 을 때는 A 과정을 반복한다. 각각의 입력 신호에 대해 ② 과정을 반복한다. SungKyunKwan Univ.

83 Viterbi Decoder [Stanford Solution]
Implementation Self-timed TBU block diagram SungKyunKwan Univ.

84 Viterbi Decoder Self-timed TBU가 request 신호를 기다리는 동안 전력 소모가 없다.
ACS는 스테이트 결정 데이터를 버리기 위해 request 신호를 내보낸 다. TBU는 이전의 surviving path memory와 previous path memory를 읽어 들여 비 교한다. 같지 않으면, TBU는 previous path memory를 update하고 self- precharging, self-requesting을 한 다음 ③ 과정을 반복한다. 같으면, ⑤ 과정으로 간다. TBU는 ACS에 scknowledgement 신호를 보내고, 다음 ACS의 request 신호를 위해 self-precharge한다. SungKyunKwan Univ.

85 References SungKyunKwan Univ.
David Johnson, Venkatesh Akella, and Brett Stott, “Micropipelined Asynchronous Discret Cosine Transform (DCT/IDCT) Processor,”IEEE Transactions on very large scale integration (VLSI) systems, vol. 6, no. 4, december 1998 T.K.Troung, Ming-Tang Shin, Irving S.Reed, E.H.Satorihs, “A VLSI Design for a Trace-Back Viterbi Decoder”, IEEE Trans. Commun., vol.40, Mar. 1992 Fettweis, G.H. Meyr, “High-Speed Parallel Viterbi Decoding Algorithm and VLSI-Architecture”, IEEE Communications, May. 1991 G. Feygin, P. Glenn Gulak, “Survivor Sequence Memory Management in Viterbi Decoders”, IEEE, 1991T.K.Troung, Ming-Tang Shin, Irving S.Reed, E.H.Satorihs, “A VLSI Design for a Trace-Back Viterbi Decoder”, IEEE Trans. Commun., vol.40, Mar. 1992 SungKyunKwan Univ.


Download ppt "Lower Power Algorithm for Multimedia Systems"

Similar presentations


Ads by Google