Presentation is loading. Please wait.

Presentation is loading. Please wait.

Derek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed Finding the Limits of Hardware Optimization through Software De-optimization Presented By:

Similar presentations


Presentation on theme: "Derek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed Finding the Limits of Hardware Optimization through Software De-optimization Presented By:"— Presentation transcript:

1 Derek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed Finding the Limits of Hardware Optimization through Software De-optimization Presented By:

2  Introduction  Project Structure  Judging de-optimizations  What does a de-op look like?  General Areas of Focus  Instruction Fetching and Decoding  Instruction Scheduling  Instruction Type Usage (e.g. Integer vs. FP)  Branch Prediction  Conclusion

3  De-optimization? That's crazy! Why???  In the world of hardware development, when optimizations are compared, the comparisons often concern just how fast a piece of hardware can run an algorithm  Yet, in the world of software development, the hardware is often a distant afterthought  Given this dichotomy, how relevant are these standard analyses and comparisons?

4  So, why not find out how bad it can get?  By de-optimizing software, we can see how bad algorithmic performance can be if hardware isn't considered  At a minimum, we want to be able to answer two questions:  How good of a compiler writer must someone be?  How good of a programmer must someone be?

5  For our research project:  We have been studying instruction fetching/ decoding/ scheduling and branch optimization  We have been using knowledge of optimizations to design and predict de-optimizations  We have been studying the Opteron in detail

6  For our implementation project:  We will choose de-optimizations to implement  We will choose algorithms that may best reflect our de- optimizations  We will implement the de-optimizations  We will report the results

7  We need to decide on an overall metric for comparison  Whether the de-op affects scheduling, caching, branching, etc, its impact will be felt in the clocks needed to execute an algorithm.  So, our metric of choice will be CPU clock cycles

8  With our metric, we can compare de-ops, but should we?  Inevitably, we will ask which de-ops had greater impact, i.e. caused the greatest jump in clocks. So, yes, we should  But this has to be done very carefully since an intended de- op may not be the actual or full cause of a bump in clocks. It could be a side effect caused by the new code combination  Of course, this would be still be some kind of a de-op, just not the intended de-op

9  Definition: A de-op is a change to an optimal implementation of an algorithm that increases the clock cycles needed to execute the algorithm and that demonstrates some interesting fact about the CPU in question  Is an infinite loop a de-op? -- NO Why not? It tells us nothing about the hardware  Is a loop that executes more cycles than necessary a de-op? -- NO Again, it tells us nothing about the CPU  Is a combination of instructions that causes increased branch mispredictions a de-op? -- YES

10  Given some CPU, what aspects can we optimize code for? These aspects will be our focus for de-optimization.  In general, when optimizing software, the following are the areas to focus on:  Instruction Fetching and Decoding  Instruction Scheduling  Instruction Type Usage (e.g. Integer vs. FP)  Branch Prediction  These will be our areas for de-optimization

11  In class, when we discussed dynamic scheduling, for example, our team was not sanguine about being able to truly de-optimize code  In fact, we even imagined that our result may be that CPUs are now generally so good that true de-optimization is very difficult to achieve. In principle, we still believe this  In retrospect, we should have been more wise. Just like Plato’s Forms, there is a significant, if not absolute, difference between something imagined in the abstract and its worldly representation. There can be no perfect circles in the real world  Thus, in practice, as Gita has stressed, CPU designers made choices in their designs that were driven by cost, energy consumption, aesthetics, etc.

12  These choices, when it comes time to write software for a CPU, become idiosyncrasies that must be accounted for when optimizing  For those writing optimal code, they are hassles that one must pay attention to  For our project team, these idiosyncrasies are potential "gold mines" for de-optimization  In fact, the AMD Opteron (K10 architecture) exhibits a number of idiosyncrasies. You will see some these today

13  AMD Opetron (K10)  The dynamic scheduling pick window is 32 bytes length while instructions can be bytes in length. So, scheduling can be adversely affected by instruction length  The branch target buffer (BTB) can only maintain 3 branch history entries per 16 bytes  Branch indicators are aligned at odd numbered positions within 16 byte code blocks. So, 1-byte branches like return instructions, if misaligned will be miss predicted

14  Intel i7 (Nehalem)  The number of read ports for the register file is too small. This can result in stalls when reading registers  Instruction fetch/decode bandwidth is limited to 16 bytes per cycle. Instruction density can overwhelm the predecoder, which can only manage 6 instructions (per 16 bytes) per cycle

15  In the upcoming discussion of de-optimization techniques, we will present... ...an area of the CPU that it derives from ...some, hopefully, illuminating title ...a general characterization of the de-op. This characterization may apply to many different CPU architectures. Generally, each of these represents a choice that may be made by a hardware designer ...a specific characterization of the de-op on the AMD Opteron. This characterization will apply only to the Opterons on Hydra

16 So, without further adieu...

17  Decoding Bandwidth  Execution Latency Instruction Fetching and Decoding

18  De-optimization #1 - Decrease Decoding Bandwidth [AMD05] Scenario #1  Many CISC architectures offer combined load and execute instructions as well as the typical discrete versions  Often, using the discrete versions can decrease the instruction decoding bandwidth Example: add rax, QWORD PTR [foo]

19  De-optimization #1 - Decrease Decoding Bandwidth (cont'd) In Practice #1 - The Opteron  The Opteron can decode 3 combined load-execute (LE) instructions per cycle  Using discrete LE instruction will allow us to decrease the decode rate Example: mov rbx, QWORD PTR [foo] add rax, rbx

20  De-optimization #1 - Decrease Decoding Bandwidth (cont'd) Scenario #2  Use of instruction with longer encoding rather than those with shorter encoding to decrease the average decode rate by decreasing the number of instruction that can fit into the L1 instruction cache  This also effectively “shrinks” the scheduling pick window  For example, use 32-bit displacements instead of 8-bit displacements and 2-byte opcode form instead of 1-byte opcode form of simple integer instructions

21  De-optimization #1 - Decrease Decoding Bandwidth (cont'd) In Practice #2 - The Opteron  The Opteron has short and long variants of a number of its instructions, like indirect add, for example. We can use the long variants of these instructions in order to drive down the decode rate  This will also have the affect of “shrinking” the Opteron’s 32-byte pick window for instruction scheduling. Example of long variant: 81 C add eax, h ;2-byte opcode form 83 C3 FB FF FF FF add ebx, -5 ;32-bit immediate value 0F jz label1 ;2-byte opcode, 32-bit immediate ;value

22  De-optimization #1 - Decrease Decoding Bandwidth (cont'd) A balancing act  The scenarios for this de-optimization have flip sides that could make them difficult to implement  For example, scenario #1 describes using discrete load-execute instructions in order to decrease the average decode rate. However, sometimes discrete load-execute instructions are called for:  The discrete load-execute instructions can provide the scheduler with more flexibility when scheduling  In addition, on the Opteron, they consume less of the 32-byte pick window, thereby giving the scheduler more options

23  De-optimization #1 - Decrease Decoding Bandwidth (cont'd) When could this happen?  This de-optimization could occur naturally when:  A compiler does a very poor job  The memory model forces long version encodings of instructions, e.g. 32-bit displacements Our prediction for implementation  We predict mixed results when trying to implement this de- optimization

24  De-optimization #2 - Increase execution latency [AMD05] Scenario  CPUs often have instructions that can perform almost the same operation  Yet, in spite of their seeming similarity, they have very different latencies. By choosing the high-latency version when the low latency version would suffice, code can be de-optimized

25  De-optimization #2 - Increase execution latency In Practice - The Opteron  We can use 16-bit LEA instruction, which is a VectorPath instruction to reduce the decode bandwidth and increase execution latency  The LOOP instruction on the Opteron has a latency of 8 cycles, while a test (like DEC) and jump (like JNZ) has a latency of less than 4 cycles  Therefore, substituting LOOP instructions for DEC/JNZ combinations will be a de-optimization.

26  De-optimization #2 - Increase execution latency (cont'd) When could this happen?  This de-optimization could occur if the user simply does the following: float a, b; b = a / 100.0; instead of: float a, b; b = a * 0.01 Our prediction for implementation  We expect this de-op to be clearly reflected in an increase in clock cycles

27  Address Generation interlocks  Register Pressure  Loop Re-rolling Instruction Scheduling

28  De-optimization #1 - Address-generation interlocks [AMD05] Scenario  Scheduling loads and stores whose addresses cannot be calculated quickly ahead of loads and stores that require the declaration of a long dependency chain  In order to generate their addresses can create address- generation interlocks. Example: add ebx, ecx; Instruction 1 mov eax, DWORD PTR [10h]; Instruction 2 mov edx, DWORD PTR [24h]; Place lode above ; instruction 3 to avoid AGI stall mov ecx, DWORD PTR [eax+ebx]; Instruction 3

29  De-optimization #1 - Address-generation interlocks (cont'd) In Practice - The Opteron  The processor schedules instructions that access the data cache (loads and stores) in program order.  By randomly choosing the order of loads and stores, we can seek address-generation interlocks. Example: add ebx, ecx; Instruction 1 mov eax, DWORD PTR [10h]; Instruction 2 (fast address calc.) mov ecx, DWORD PTR [eax+ebx]; Instruction 3 (slow address calc.) mov edx, DWORD PTR [24h] ; This load is stalled from accessing ; the data cache due to the long ; latency caused by generating the ; address for instruction 3

30  De-optimization #1 - Address-generation interlocks (cont'd) When could this happen?  This happen when we have a long chain dependency of loads and stores addresses a head of one that can be calculated quickly. Our prediction for implementation:  We expect an increasing in the number of clock cycles by using this de-optimization technique.

31  De-optimization #2 - Increase register pressure [AMD05] Scenario  Avoid pushing memory data directly onto the stack and instead load it into a register to increase register pressure and create data dependencies. Example: In Practice - The Opteron  Permit code that first loads the memory data into a register and then pushes it into the stack to increase register pressure and allows data dependencies. Example: push mem mov rax, mem push rax

32  De-optimization #2 - Increase register pressure When could this happen?  This could take place by different usage of instruction load and store, when we have a register and we load an instruction into a register and we push it into a stack Our prediction for implementation:  We expect the performance will be affected by increasing the register pressure

33  De-optimization #3 - Loop Re-rolling Scenario  Loops not only affect branch prediction. They can also affect dynamic scheduling  How ?  Let instructions 1 and 2 be within loops A and B, respectively. 1 and 2 could be part of a unified loop. If they were, then they could be scheduled together. Yet, they are separate and cannot be In Practice - The Opteron  Given that the Opteron is 3-way scalar, this de-optimization could significantly reduce IPC

34  De-optimization #3 - Loop Re-rolling When could this happen?  Easily, in C, this would be two consecutive loops each containing one or more many instructions such that the loops could be combined Our prediction for implementation  We expect this de-op to be clearly reflected in an increase in clock cycles Example: --- Version for( i = 0; i < n; i++ ) { quadratic_array[i] = i * i; cubic_array[i] = i * i * i; } --- Version for( i = 0; i < n; i++ ) { quadratic_array[i] = i * i; } for( i = 0; i < n; i++ ) { cubic_array[i] = i * i * i; }

35  Store-to-load dependency  Costly Instruction Instruction Type Usage

36  De-optimization #1 – Store-to-load dependency Scenario  Store-to-load dependency takes place when stored data needs to be used shortly.  This is commonly used.  This type of dependency increases the pressure on the load and store unit and might cause the CPU to stall especially when this type of dependency occurs frequently. Example: for (k=1;k<1000;k++) { x[k]=temp+y[k]; temp=x[k]; } for (k=1;k<1000;k++) { x[k]=x[k-1]+y[k]; }

37  De-optimization #1 – Store-to-load dependency When could this happen?  In many instructions, when we load the data which is stored shortly. Our prediction for implementation:  We expect this de-optimization results in lower performance to the load store unit.

38  De-optimization #2 – Using equivalent more costly instruction Scenario  Some instructions can do the same job but with more cost in term of number of cycles In Practice- The Opetron  Integer division for Opetron costs cycles for signed, and unsigned  While it takes only 3-8 cycles for both signed and unsigned multiplication

39  De-optimization #2 – Using equivalent more costly instruction (cont'd) When could this happen?  Integer division and multiplication for many codes Example: int i, j, k, m; (a)- m = i / j / k; (b)- m = i / (j * k); Our prediction for implementation:  This de-optimization significantly will increase number of cycles

40  Branch Density  Branch Patterns  Non-predictable Instructions Branch Prediction

41  De-optimization #1 - Branch Density Scenario Compare R1, 10 Jump-if-equal handle_10 Jump-if-less-than handle_lt_10 Call set_up_for_gt10  There are 3 consecutive branch instructions that must be predicted  Whether or not a bubble is created is dependent upon the hardware  However, at some point, the hardware can only predict so much and pre-load so much code  This de-optimization attempts to overwhelm the CPUs ability to predict a branch code

42  De-optimization #1 - Branch Density (cont'd) In Practice - The Opteron Scenario DEC R1 JZ handle_n1 DEC R1 JZ handle_n2 DEC R1 JZ handle_n3 DEC R1 JZ handle_n4  Most branch instruction are two bytes long.  These 8 instructions can take up as little as 16 bytes on an Opteron

43 401399: 8b mov 0x10(%esp),%eax 40139d: 48 dec %eax 40139e: 74 7a je 40141a 4013a0: 8b 0f mov (%edi),%ecx 4013a2: 74 1b je 4013bf 4013a4: 49 dec %ecx 4013a5: 74 1f je 4013c6 4013a7: 49 dec %ecx 4013a8: je 4013cf 4013aa: 49 dec %ecx 4013ab: 74 2b je 4013d8 4013ad: 49 dec %ecx 4013ae: je 4013e1 4013b0: 49 dec %ecx 4013b1: je 4013ea 4013b3: 49 dec %ecx 4013b4: 74 3d je 4013f3 4013b6: 49 dec %ecx 4013b7: je 4013fc 4013b9: 49 dec %ecx 4013ba: je bc: 49 dec %ecx 4013bd: 74 4f je 40140e

44  De-optimization #1 - Branch Density (cont'd) In Practice - The Opteron  However, the Opteron's BTB (Branch Target Buffer) can only maintain 3 (used) branch entries per (aligned) 16 bytes of code [AMD05]  Thus, the Opteron cannot successfully maintain predictions for all of the branches within previous sequence of instructions  Why? 9 branch indicators associated with byte# 0,1,3,5,7,9,11, 13, & 15  Only 3 branch selectors

45  De-optimization #1 - Branch Density (cont'd) When could this happen?  Having dense branches is not that unusual. Most compilers translate case/switch statement to a comparison chain, which is implemented as dec/jz instruction. Our prediction for implementation  By seeking dense branches we expect the branch prediction unit to have more mispredictions.

46  De-optimization #2 - Branch Patterns Scenario Consider the following algorithm: Algorithm Even-Number-Sieve  Input: An array of random numbers  Output: An array of numbers where the odd numbers have been replaced with zero  Even-Number-Sieve must have a branch within that depends upon whether the current array entry is even or odd  Given an even probability distribution, there will be no pattern that can be selected that will yield better than 50% success

47  De-optimization #2 - Branch Patterns  Let the word “parity” refer to a branch that has an even chance of being taken as not taken  The Odd/Even branch within Even-Number-Sieve has parity  Furthermore, it has no simple pattern that can be predicted  Yet, data need not be random. All we need is a pattern whose repetition outstrips the hardware bits used to predict it  In fact, given the right pattern, branch prediction can be forced to perform with a success rate well less than 50%

48  De-optimization #3 - Unpredictable Instructions Scenario  Some CPUs restricts only one branch instruction be within a certain number bytes  If this exceeded or if branch instructions are not aligned properly, then branches cannot be predicted  Misprediction can take place with undesirable usage of recursion [AMD05]  Far control transfer usually mispredicted for different types of architecture

49  De-optimization #3 - Unpredictable Instructions In practice - The Opetron  RET instruction may only take up one byte  If a branch instruction immediately precedes a one byte RET instruction, then RET cannot be predicted. Example:  One byte RET instruction can cause a misprediction even if we one branch instruction per 16 bytes. --- Case Jmp Label1 ……… Label1: RET --- Case Jz Label1 RET ……… Label1:

50  De-optimization #3 - Unpredictable Instructions When could this happen?  When branch instructions are not aligned properly, then branches cannot be predicted  Having a branch ended at an even address followed by single- byte return instruction, will cause a conflicting of using the branch selector and will cause a miss prediction most of the time.

51  De-optimization #3 - Unpredictable Instructions  Return address would be mispredicted for recursive function if there is any other call with in the recursive function (except for itself) Example: long fac (long a) { if (a==0) { return (1);} else {add (a); // this case the returns to be mispredicted. return (a*fac(a-1)) } } When could this happen?  A function call which take place before the return of the recursive function.

52  De-optimization #3 - Unpredictable Instructions Our prediction for implementation  We expect this de-op technique to have a real affect on the performance in term of increasing the number of misprediction  Minimum penalty of misprediction for Opetron is 10 cycles [AMD11]

53  Whether based on cost or some other reason, the decisions made by CPU designers introduced many idiosyncrasies into various CPU designs  As we’ve shown, these idiosyncrasies offer manifest opportunities to de-optimize software. All of this is in spite of how far abstract hardware design techniques, like dynamic scheduling, have come  So, like Plato’s circle, the difference between the abstract and the concrete is significant  By examining these idiosyncrasies as de-optimizations, you’ve seen that ignoring hardware is to be done at your peril…or you’d better have a fantastic compiler

54  In the next phase, we will show the practical effects of ignoring hardware  We will show that some of these de- optimizations have teeth

55 [AMD05] AMD64 Technology. Software Optimization Guide for AMD64 Processors, 2005 [AMD11] AMD64 Technology. AMD64 Architecture Programmers Manual, Volume 1: Application Programming [AMD11] AMD64 Technology. AMD64 Architecture Programmers Manual, Volume 2: System Programming [AMD11] AMD64 Technology. AMD64 Architecture Programmers Manual, Volume 3: General Purpose and System Instructions. 2011

56


Download ppt "Derek Kern, Roqyah Alalqam, Ahmed Mehzer, Mohammed Mohammed Finding the Limits of Hardware Optimization through Software De-optimization Presented By:"

Similar presentations


Ads by Google