Download presentation
Presentation is loading. Please wait.
1
Optimisation de logiciels pour les systèmes enfouis Prof. Koen De Bosschere Université de Gand
2
Optimizations for power Fifth Lecture
3
April 30, 2004Francqui Chair lecture 5: optimizations for power3 Power vs. Energy Power E = ∫ P(t)dt t
4
April 30, 2004Francqui Chair lecture 5: optimizations for power4 Low Power <> Low Energy Minimizing the power consumption is important for –the design of the power supply –the design of voltage regulators –the dimensioning of interconnects –the design of the cooling system Minimizing the energy consumption is important because of –restricted availability of energy (mobile systems) limited battery capacities (only slowly improving) very high costs of energy (solar panels, space equipment) –cooling high costs limited space –Dependability: long lifetimes, low temperatures –Electricity bill
5
April 30, 2004Francqui Chair lecture 5: optimizations for power5 Battery capacity growth limited to 2-3% per year Processor performance Hard Disk (capacity) Memory (capacity) Battery (energy stored) 0 1 2 3 4 5 6 16x 14x 12x 10x 8x 6x 4x 2x 1x Improvement (compared to year 0) Time (years) [J. Rabaey]
6
April 30, 2004Francqui Chair lecture 5: optimizations for power6 Battery Characteristics TypeWh/litreRechargable Alkaline Mn0 2 347No Silver Oxide500No Li/MnO 2 550No Zinc Air1150No NiCd125Yes Li-Polymer300-415Yes
7
April 30, 2004Francqui Chair lecture 5: optimizations for power7 Power dissipation Electronic device V dd heat
8
April 30, 2004Francqui Chair lecture 5: optimizations for power8 Power dissipation in digital logic –Dynamic power consumption charging and discharging capacitors –Short circuit currents short circuit path between supply rails during switching –Leakage current from reverse biased junctions V in V out V dd CLCL Three main components to power dissipation
9
April 30, 2004Francqui Chair lecture 5: optimizations for power9 An example
10
April 30, 2004Francqui Chair lecture 5: optimizations for power10 Power dissipation in digital logic P =½C.V 2.f.A + I sc.V + I leak.V where A = activity factor (probability of 0 1 transition) C = total chip capacitance V = supply voltage f = clock frequency I sc = short circuit current when logic level changes I leak = leakage current in diodes and transistors
11
April 30, 2004Francqui Chair lecture 5: optimizations for power11 Leakage power 0 10 20 30 40 50 60 70 80 90 30405060708090100110 0 10 20 30 40 50 60 70 80 90 30405060708090100110 180 nm, 1.4V90 nm, 0.7V Watt Temp Active power Leakage power
12
April 30, 2004Francqui Chair lecture 5: optimizations for power12 How to reduce power dissipation? P =½C.V 2.f.A + I sc.V + I leak.V ≈ ½C.V 2.f.A V ↓ : reduce the supply voltage f ↓ : reduce the frequency A ↓ : reduce the activity factor C ↓ : depends on the technology
13
April 30, 2004Francqui Chair lecture 5: optimizations for power13 Reduce the supply voltage Reduces the power consumption quadratically But: also increases gate delay, and hence reduces frequency P = ½C.V 2.f.A + I sc.V + I leak.V Performance ~ V 3 “Cube root formula”
14
April 30, 2004Francqui Chair lecture 5: optimizations for power14 Reduce the frequency Reduces the power consumption linearly But also reduces the performance linearly!
15
April 30, 2004Francqui Chair lecture 5: optimizations for power15 Reduce the activity factor Reduce the number of times the transistor switches Not an easy task, savings are limited
16
April 30, 2004Francqui Chair lecture 5: optimizations for power16 Metrics for Power Absolute power (W) –average energy per time unit consumed –sets battery life in hours –problem: power frequency Power can be lowered by lowering the clock frequency
17
April 30, 2004Francqui Chair lecture 5: optimizations for power17 Energy per operation Joule/op = (Joule/s) / (op/s) = Watt/ips More familiar: Watt/MIPS = power x time 1990-2004: 100 mW/MIPS → 1 mW/MIPS Most optimistic projections stop at 60 μW/MIPS
18
April 30, 2004Francqui Chair lecture 5: optimizations for power18 Other power metrics PDP (power delay product) = power x time = Watt/MIPS ~ Watt/Spec [For power constrained systems] PD 2 P = EDP (Energy delay product) = Watt/MIPS 2 ~ Watt/Spec 2 PD 3 P = ED 2 P = Watt/MIPS 3 ~ Watt/Spec 3 [For high-end systems]
19
April 30, 2004Francqui Chair lecture 5: optimizations for power19 Example: Power Consumption for a Computer with Wireless NIC Display 36% Wireless LAN 18% Hard Drive 18% CPU/Memory 21% Other 7%
20
April 30, 2004Francqui Chair lecture 5: optimizations for power20 Where does the power go? Power Breakdown Hitachi SH2
21
April 30, 2004Francqui Chair lecture 5: optimizations for power21 How to measure power? Simulation –Wattch (Princeton University) –Simplepower (Penn State University) –Powertime (IBM) Models –E.g. Cacti (caches) Actual measurements
22
April 30, 2004Francqui Chair lecture 5: optimizations for power22 Overview Hardware methods –Clock gating –Hardware reconfiguration –Dynamic voltage and frequency scaling –Transmeta’s Crusoe –Application specific hardware Software methods
23
April 30, 2004Francqui Chair lecture 5: optimizations for power23 Clock Gating Gate off clock to idle functional units –e.g., floating point units –needs logic to generate disable signal increases complexity of control logic consumes power RegReg clock disable Functional unit E.g. Pentium, Alpha
24
April 30, 2004Francqui Chair lecture 5: optimizations for power24 Hardware reconfiguration When the hardware resources can be reduced (e.g. half the number of registers, half of the cache), we could reconfigure the hardware Switch off branch predictor when performance is less critical (less speculative execution) Pipeline gating: stop fetching instructions if there are too many low confident speculative instructions in the pipeline
25
April 30, 2004Francqui Chair lecture 5: optimizations for power25 Dynamic frequency and voltage scaling Reduce frequency/voltage to bring down the power until desired performance level. 024681012 0.0 0.2 0.4 0.6 0.8 1.0 Energy factor Performance High voltage high frequency Low voltage low frequency
26
April 30, 2004Francqui Chair lecture 5: optimizations for power26 Processor Utilization (%) Time (s) Dialup Server Workstation Fileserver Typical Processor Workload
27
April 30, 2004Francqui Chair lecture 5: optimizations for power27 Real utilization trace Interactive (Acrobat Reader), Producer (MP3 playback), and Consumer (esd sound daemon) episodes. 100%
28
April 30, 2004Francqui Chair lecture 5: optimizations for power28 Episode classification for power reduction Faster is not always better. Fundamental limit to what is perceptible to humans. Movies: 20-30 frames per second. Perceptual causality: 50ms-100ms. Dragging objects on screen: 200ms. Non-continuous operation: 1-2sec. Periodic activity determines necessary performance for real-time tasks. Response time: the time it takes for the computer to respond to user initiated events. The goal is to run fast enough to meet the perception threshold, no point to running any faster.
29
April 30, 2004Francqui Chair lecture 5: optimizations for power29 Shutdown or slowdown? ActiveIdle E fixed ~ CV dd 2 T frame T Fixed Supply Active Variable Supply E var C(V dd /2) 2 = 1/4E fixed 00.20.40.60.81.0 0 0.2 0.4 0.6 0.8 1.0 Normalized Workload Shutdown slowdown ~ Normalized power
30
April 30, 2004Francqui Chair lecture 5: optimizations for power30 Small performance reduction = big energy savings 20% performance reduction = 32% energy reduction 40% performance reduction = 55% energy reduction [arm]
31
April 30, 2004Francqui Chair lecture 5: optimizations for power31 Power Consumption for Compaq WRL’s Itsy Computer System power < 1W –doing nothing (processor 95% idle) 107 mW @ 206 MHz 77 mW @ 59 MHz 62 mW @ 59 MHz, low voltage –MPEG-1 with audio 850 mW @ 206 MHz (16% idle) –Dictation 775 mW @ 206 MHz (< 0.5% idle) –text-to-speech 420 mW @ 206 MHz (53% idle) 365 mW @ 74 MHz, low voltage ( < 0.5% idle) Processor: 200 mW –42-50% of typical total LCD: 30-38 mW –15% of typical total Itsy v1 StrongARM 1100 59–206 MHz (300 μs to switch) 2 core voltages (1.5V, 1.23V) 64M DRAM / 32M FLASH Touchscreen & 320x200 LCD codec, microphone & speaker serial, IrDA
32
April 30, 2004Francqui Chair lecture 5: optimizations for power32 Dynamic Frequency and Voltage Scaling Intel’s SpeedStep Transmeta’s Longrun AMD’s powernow ARM
33
April 30, 2004Francqui Chair lecture 5: optimizations for power33 Maximum Performance ModeBattery Optimized Mode FrequencyVoltageMax. Power Consumption FrequencyVoltageMax. Power Consumption 600 MHz*1.35 V14.4 W500 MHz1.10 V8.1 W 600 MHz1.6 V20.0 W500 MHz1.35 V12.2 W 650 MHz1.6 V21.5 W500 MHz1.35 V12.2 W 700 MHz1.6 V23.0 W550 MHz1.35 V13.2 W 750 MHz1.6 V24.6 W550 MHz1.35 V13.2 W *Low Voltage Version Intel’s SpeedStep Just 2 steps Support by BIOS, OS, chip set, and voltage regulator CPU stalls during SpeedStep adjustment
34
April 30, 2004Francqui Chair lecture 5: optimizations for power34 Transmeta’s Longrun DVS + DFS –from 1.1V to 1.6V in 32 levels –from 200MHz 700MHz in steps of 33MHz Triggered when CPU load change is detected by code morphing software –heavier load ramp up V DD, when stable speed up clock –lighter load slow down clock, when stable, ramp down V DD CPU stalls only during 20 μs
35
April 30, 2004Francqui Chair lecture 5: optimizations for power35 AMD’s PowerNow! 8 frequency steps, external VRM One frequency step is 100 Mhz claim to extend battery life by up to 30% Three modes –high performance mode - the CPU runs at maximum frequency and voltage. –battery saver mode - the CPU runs at lowest frequency and voltage to maximize system battery life. –automatic mode - the system monitors application usage and continuously varies operating frequency and voltage to deliver performance on demand while optimizing battery life.
36
April 30, 2004Francqui Chair lecture 5: optimizations for power36 ARM StrongARM series (SA-110, SA-1100, etc) –16 frequency steps, requires external VRM –Usually only 11 (59.. 206MHz, 0.8-1.5 V) used
37
April 30, 2004Francqui Chair lecture 5: optimizations for power37 Dynamic Thermal Management On chip temperature sensors Choose trigger level to start intervention, check e.g. every 100 000 cycles Cost reduction because the processor can now be designed for average heat production instead of for peak heat production
38
April 30, 2004Francqui Chair lecture 5: optimizations for power38 DTM Savings Benefits Time Temperature DTM Disabled DTM/Response Engaged Designed for cooling capacity without DTM DTM trigger level Designed for cooling capacity with DTM System Cost Savings DVS/DFS
39
April 30, 2004Francqui Chair lecture 5: optimizations for power39 Multiple Voltage Domains Some parts of the processor are more critical than others –Use high V dd for the critical parts –Use low V dd for the less critical parts
40
April 30, 2004Francqui Chair lecture 5: optimizations for power40 Microsoft’s OnNow Win32 API extension allows applications to –affect the power management decision making –adapt to power state find out if running on batteries so as to reduce processing discover disk state & postpone low priority I/O e.g. paging Based Advanced Configuration and Power Management Interface (ACPI), responsible for thermal management, battery management, system events (e.g. connected to power grid), processor power management, etc.
41
April 30, 2004Francqui Chair lecture 5: optimizations for power41 Transmeta’s Crusoe FPU (Floating Point Unit) –Has a 10-stage floating point pipeline –Uses conventional x86 80- bit register format 32 FP registers 2 Integer ALU (Arithmetic- Logic Units) –Has a 7-stage integer pipeline –64 32-bit registers dedicated to it LSU (Load/Store Unit) Branch Unit Efficeon = 8 atoms per molecule, max 1,2 GHz
42
April 30, 2004Francqui Chair lecture 5: optimizations for power42 Code morphing example addl %eax, (%esp) addl %ebx, (%esp) movl %esi, (%ebp) subl %ecx, 5 ld %r30, [%esp] add.c %eax, %eax, %r30 ld %r31, [%esp] add.c %ebx, %ebx, %r31 ld %esi, [%ebp] sub.c %ecx, %ecx, 5 ld %r30, [%esp] add %eax, %eax, %r30 add %ebx, %ebx, %r30 ld %esi, [%ebp] sub.c %ecx, %ecx, 5 ld %r30, [%esp]; sub.c %ecx, %ecx, 5 ld %esi, [%ebp]; add %eax, %eax, %r30; add %ebx, %ebx, %r30 FRONTEND OPTIMIZER SCHEDULER IA32 code
43
April 30, 2004Francqui Chair lecture 5: optimizations for power43 Thermal comparison Intel Pentium IIICrusoe TM5400 Processor Power density not uniform
44
April 30, 2004Francqui Chair lecture 5: optimizations for power44 Comparison of power consumption 0 0.2 0.4 0.6 0.8 1 1.2 Office 2000Web BrowserMp3DVD Average power Mobile Pentum III 500Mhz TM5400 500 Mhz
45
April 30, 2004Francqui Chair lecture 5: optimizations for power45 Conclusion Transmeta Pros –Power efficient (3-4 times better than Pentium III) –Smaller (only 25% of the transistors of the Pentium III) –Low power idle mode: 8mW. Cons –Slower: 40%-50% than Pentium III –Memory consumption (Code Morphing Software=2 MiB, Translation cache = 4-12 MiB) –Code morphing requires sixfold memory consumption
46
April 30, 2004Francqui Chair lecture 5: optimizations for power46 ASIC-FPGA-ASIP-GPP ASIC Application Specific Circuit ASIC Application Specific Circuit ASIP Application Specific Instruction-set Processor ASIP Application Specific Instruction-set Processor GPP General- purpose Processor GPP General- purpose Processor Flexibility Speed, Efficiency FPGA Field Programmable Gate Array FPGA Field Programmable Gate Array the data-path width the number of general registers Instruction set Data/instruction memory space
47
April 30, 2004Francqui Chair lecture 5: optimizations for power47 An Example Example: –HDTV chromakey algorithm –22,000 lines of C code Partitioning: –approx. 15 lines of code synthesized as ASIC Result here: –yields 77% energy savings... For (I=cr1;I<=cr2;I++) { ihilf=abs(cr-I)+abs(cb-vtab[I]); if (ihilf<iabsv) { iabsv=ihilf; } iabsv=512; For (I=cr1;I<=cr2;I++) { ihilf=abs(cr-I)+abs(cb-htab[I]); if (ihilf<iabsh) { iabsh=ihilf; } 21698 21715 If (cb>vtab[cr] {21687... ASIC
48
April 30, 2004Francqui Chair lecture 5: optimizations for power48 Valen-C Valen-C Program int12 x; int20 y; int24 z; 12-bit processor x y y z z unused: 4 bits total: 60 bits 20-bit processor x y z Z unused bits unused: 24 bits total: 80 bits z [Based on slide by & © : H. Yasuura, 2000]
49
April 30, 2004Francqui Chair lecture 5: optimizations for power49 ASIP Optimization ASIP 1W=32bits Initial Design System Optimization Modification of design Opt. of Cost, power and Performance logic Layout Synthesis Valen-C Compiler ASIP Processor 1W=23bits Machine Code Optimized Design Application Program in Valen-C EDA Tech. Software Tech.
50
April 30, 2004Francqui Chair lecture 5: optimizations for power50 Application: Decimal 12 bit Calculator Valen-C (400 lines) size # 4257 8257 14 3 39258 [Yasuura, 2000] 0 20 40 60 80 100 120 140 160 180 1018 26 34 bitwidth datapath area cycles power
51
April 30, 2004Francqui Chair lecture 5: optimizations for power51 Overview Hardware techniques Software techniques –Focus at most power hungry subsystems The processor The memory hierarchy –How Make the program faster Optimize the code Optimize the memory hierarchy
52
April 30, 2004Francqui Chair lecture 5: optimizations for power52 Make the program faster Power E = ∫ p(t)dt t Power might increase, but energy consumption will often decrease
53
April 30, 2004Francqui Chair lecture 5: optimizations for power53 Optimize the code Exploit characteristics of the target architecture Examples of instruction optimizations –Low-energy instructions –Conditional Execution –Switch Statement versus Table Lookup –Register Allocation –Inline assembly
54
April 30, 2004Francqui Chair lecture 5: optimizations for power54 Low-energy instructions Energy as a cost function in instruction selection Only ~ 10% improvement Current [mA] 4045505560 ADD & LOGICAL SUB & COMPARE SHIFT MOV BRANCH MULTIPLY LOAD STORE PUSH POP 38
55
April 30, 2004Francqui Chair lecture 5: optimizations for power55 Data representation signal processing algorithms often use floating point data CPUs usually are much more efficient with integer computation e.g. StrongARM emulates floating point in software implement a fixed-precision library Can result in large energy savings, and even performance improvement
56
April 30, 2004Francqui Chair lecture 5: optimizations for power56 Reduction of bit switching on address lines Instruction reordering Operation/Operand swapping Choice of don’t care bits 00110010100111010100001100100000 00111010100011011101111100100100 00110010100101010101001100100100 Hamming distance = 7 i1i1 i2i2 i3i3 Hamming distance = 6 00110010100111010100001100100000 00111010100011011101111100100100 00110010100101010101001100100100 Hamming distance = 3 i1i1 i3i3 i2i2 Hamming distance = 6 13 9 Highest reduction ever reported: 15% for DSP
57
April 30, 2004Francqui Chair lecture 5: optimizations for power57 Optimize the memory hierarchy registers on-chip L1 cache (SRAM) main memory (DRAM) More Energy Required i486 300 mA/cycle 430 mA/cycle 530 mA/cycle Current per access Try to put the data (and code) as close as possible to the processor. The closer is gets, the less power is needed to access them.
58
April 30, 2004Francqui Chair lecture 5: optimizations for power58 On-chip vs. off-chip current Example: Atmel ARM-Evaluation board Processor On-chip memory Off-chip memory 32 Bit-Load Instruction (Thumb) 48,2 50,9 44,4 53,1 116 77,2 82,2 1,16 0 50 100 150 200 Prog Off-Chip/ Data Off-Chip Prog Off-Chip/ Data On-Chip Prog On-Chip/ Data Off-Chip Prog On-Chip/ Data On-Chip mA Core+On-Chip-Memory Current (mA) Off-Chip-Memory Current (mA)
59
April 30, 2004Francqui Chair lecture 5: optimizations for power59 On-chip vs. off-chip energy Off-chip access takes more cycles savings (86%) are larger than for the current. 115,8 51,6 76,5 16,4 0,0 20,0 40,0 60,0 80,0 100,0 120,0 140,0 Prog Off-Chip/ Data Off-Chip Prog Off-Chip/ Data On-Chip Prog On-Chip/ Data Off-Chip Prog On-Chip/ Data On-Chip 10 nJ 32 Bit-Load Instruction (Thumb) 48,2 50,9 44,4 53,1 116 77,2 82,2 1,16 0 50 100 150 200 Prog Off-Chip/ Data Off-Chip Prog Off-Chip/ Data On-Chip Prog On-Chip/ Data Off-Chip Prog On-Chip/ Data On-Chip mA
60
April 30, 2004Francqui Chair lecture 5: optimizations for power60 Exploitation of on-chip memory Which segment (array, loop, etc.) to be stored in on-chip memory? Gain g i and size s i for each segment i. Maximise gain G = g i, respecting constraint K s i. Static memory allocation: Solution: knapsack algorithm. Dynamic reloading: Paging theory. Processor On-chip memory, capacity K Off-chip memory ? for i.{ } for j..{ } while... Repeat call... for i.{ } for j..{ } while... Repeat call... Array... Int... Array
61
April 30, 2004Francqui Chair lecture 5: optimizations for power61 Results for knapsack algorithm [Steinke, 2001] 0%10%20%30%40%50% lattice_init me_ivlin heap_sort selection_sort insertion_sort bubble_sort matrix_mult biquad_N_sections Onchip/MemSize Energy Saving
62
April 30, 2004Francqui Chair lecture 5: optimizations for power62 Why not just use a cache ? Energy consumption in tags, comparators and muxes is significant. 0 1 2 3 4 5 6 7 8 9 256512102420484096819216384 memory size Energy per access [nJ]. Scratch pad Cache, 2way, 4GB space Cache, 2way, 16 MB space Cache, 2way, 1 MB space
63
April 30, 2004Francqui Chair lecture 5: optimizations for power63 Let’s help the compiler... DTSE: data transfer and storage exploration DTSE is a methodology to explore data-transfer and data-storage in multimedia applications –Transforms C-code of the application –Focuses on multi-dimensional signals (arrays) –Tries to exploit platform capabilities Supported by tools Developed at IMEC (Leuven) [Based on slides by H. Corporaal]
64
April 30, 2004Francqui Chair lecture 5: optimizations for power64 C-in Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out DTSE Optimizations
65
April 30, 2004Francqui Chair lecture 5: optimizations for power65 N M Data-flow trafo - example for (x=0; x<N; ++x) for (y=0; y<M; ++y) gauss_x_image[x][y]=0; for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } M-2 N-2 #accesses: N * M + (N-2) * (M-2)
66
April 30, 2004Francqui Chair lecture 5: optimizations for power66 Data-flow trafo - example N M N-2 M-2 for (x=0; x<N; ++x) for (y=0; y<M; ++y) if ((x>=1 && x<=N-2) && (y>=1 && y<=M-2)) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } else { gauss_x_image[x][y] = 0; } #accesses: N * M gain is ± 50 %
67
April 30, 2004Francqui Chair lecture 5: optimizations for power67 Loop transformations –improve regularity of accesses –improve temporal locality: production consumption Expected influence –reduce temporary storage and (anticipated) background storage storage size N Loop transformations for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]); for (i=1; i<=N; i++) out[i] = A[i]; for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i]; } storage size 1
68
April 30, 2004Francqui Chair lecture 5: optimizations for power68 #A = 100 Processor Data Paths Reg File M Main memory P = 1 Data re-use & memory hierarchy Processor Data Paths Reg File M Main memory P = 1 M’ P = 0.1 M’’ P = 0.01 100 1 10 P (original) = # access x power/access = 100 P (after) = 100 x 0.01 + 10 x 0.1 + 1 x 1 = 3
69
April 30, 2004Francqui Chair lecture 5: optimizations for power69 Data re-use Data flow transformations can introduce extra copies of heavily accessed signals –Step 1: figure out data re-use possibilities –Step 2: calculate possible gain –Step 3: decide on data assignment to memory hierarchy int[2][6] A; for (h=0; h<N; h++) for (i=0; i<2; i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i][k]; iterations array index (6 * i + k)
70
April 30, 2004Francqui Chair lecture 5: optimizations for power70 Memory allocation and assignment L3 L2 L1 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile
71
April 30, 2004Francqui Chair lecture 5: optimizations for power71 Data layout B C D A C A D B E C A D E B time E addresses A B C D E Inter in-place Intra in-place
72
April 30, 2004Francqui Chair lecture 5: optimizations for power72 for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { Ad[ ] = A[ ]; }} dist += A[ ]-Ad[ ]; } cse1 = (33025*i+6869616)*2; cse3 = 1040+i; cse4 = j*257+1032; cse5 = k+cse4; cse5+cse1 = cse5+cse3 3096 cse1 ADOPT principles for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) for (k=- 4; k<=3; k++) { A[((208+i)*257+8+j)*257+ 16+i+k] = B[(8+j)*257+16+i+k]; } dist += A[3096] - B[((208+i)*257+4)*257+ 16+i-4]; } Example: Full-search Motion Estimation Algebraic transformations at word-level
73
April 30, 2004Francqui Chair lecture 5: optimizations for power73 Cavity detection on Pentium- MMX Main Memory Accesses Local Memory Accesses Execution Time (sec)
74
April 30, 2004Francqui Chair lecture 5: optimizations for power74 Questions?
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.