Pipeline Enhancements for the Y86 Architecture

Pipeline Enhancements for the Y86 Architecture
Kelly Carothers

Enhancments done Hardware: BTFNT Branch Jumping
Load-forwarding for variables Software: Use of IADDL Rearrangement of code Loop Unrolling Will be presented in order done and Avg CPE values will be cummulative.

Load-forwarding The passing of variables from further in the pipe backwards before it is written to a register or memory. CPE Avg: 17.15 Used to prevent stalling by moving variables that have yet to be written to the cache of the previous stage. CPE decrease of 1.00 Pipe only stalls for POPL and MRMOVL instructions in Execute stage

Load-forwarding from Memory stage to Execute Stage

IADDL Single instruction replaces the IRMOVL and ADDL instructions for an immediate add. CPE Avg: 14.22 Cuts down on an instruction each time it replaces IRMOVL & ADDL. Frees a variable for other purposes. CPE decrease of 2.93 The CPE decrease is for both the inclusion of the new instruction in hardware and replacing of the two instructions in ncopy where applicable. Most useful, Biggest CPE decrease (likely b/c it is the most used instruction)

IADDL implementation Very simple as its implementation is a mix between the IRMOVL and ADDL instructions without the intermediate storing and loading processes.

IADDL Code Comparison: Original vs. Modified
# Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done: # Loop body. Loop: mrmovl (%ebx), %eax rmmovl %eax, (%ecx) andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos: iaddl $1, %esi # count++ Npos: iaddl $-1, %edx # len-- iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %edx,%edx # len > 0? jg Loop # if so, goto Loop: # Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done: # Loop body. Loop: mrmovl (%ebx), %eax #src... rmmovl %eax, (%ecx) # ...and store it to dst andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos: irmovl $1, %edi addl %edi, %esi # count++ Npos: irmovl $1, %edi subl %edi, %edx # len-- irmovl $4, %edi addl %edi, %ebx # src++ addl %edi, %ecx # dst++ andl %edx,%edx # len > 0? jg Loop # if so, goto Loop: Very simple as its implementation is a mix between the IRMOVL and ADDL instructions without the intermediate storing and loading processes.

BTFNT Branch Jumping BTFNT – Backwards Taken Forwards Not Taken:
Always take the smaller address. CPE Avg : 12.37 65% success rate for BTFNT. 60% success rate for default, always take. CPE decrease of 1.85 2nd most useful

Code Rearrangement *Code was arranged specifically for BTFNT
*Many unnecessary checks removed Avg CPE: 11.71 No loop unrolling CPE decrease of .66

Code Rearrangement: IADDL Mod vs. End Result
# Loop header xorl %esi,%esi # count = 0; andl %edx,%edx # len <= 0? jle Done # if so, goto Done: # Loop body. Loop: mrmovl (%ebx), % rmmovl %eax, (%ecx) andl %eax, %eax # val <= 0? jle Npos # if so, goto Npos: iaddl $1, %esi # count++ Npos: iaddl $-1, %edx # len-- iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %edx,%edx # len > 0? jg Loop # if so, goto Loop: rrmovl %edx, %esi iaddl $1, %edx Loop: iaddl $-1, %edx jle Done Loop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx) Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi jmp Loop decEsi: iaddl $-1, %esi, jg Loop Very simple as its implementation is a mix between the IRMOVL and ADDL instructions without the intermediate storing and loading processes.

*Increases code size *Decreases CPE
Loop Unrolling *Increases code size *Decreases CPE More unrolling = faster code b/c of less looping but much larger size due to repeated code Cheap way to decrease the CPE.

Loop Unrolling: No unrolling vs. 1 unroll
Loop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx) Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi jmp Loop Loop1: mrmovl (%ebx), %eax rmmovl %eax, (%ecx) Npos: iaddl $4, %ebx # src++ iaddl $4, %ecx # dst++ andl %eax, %eax jle decEsi iaddl $-1, %edx jle Done mrmovl (%ebx), %eax iaddl $4, %ebx iaddl $4, %ecx jmp Loop

Loop Unrolling Results
No Unrolling, Base Avg. CPE: 11.64 1 Unroll, Avg CPE: 11.16 2 Unrolls, Avg CPE: 11.00 More unrolling = faster code b/c of less looping but much larger size due to repeated code. No unrolling is same CPE as after code rearrangement. 1 unroll is .48 less than no unrolling. 2 unrolls is .64 less than no unrolling.

Total Results Initial Avg CPE: 18.15 Final Avg CPE: 11.00
Total Decrease of CPE. Final Avg CPE is based on 2 loop unrolls, seemed to be the best choice, performance wise b/c the gain after that was small and seemed to shrink exponentially.

Final Results Enhancement AVG CPE CPE Decrease None 18.15 -------
Load-Forwarding 17.15 1.00 IADDL 14.22 2.93 BTFNT 12.37 1.85 Code Rearranging 11.64 .73 1 Loop Unrolled 11.16 .48 2 Loops Unrolled 11.00 .16

Pipeline Enhancements for the Y86 Architecture

Similar presentations

Presentation on theme: "Pipeline Enhancements for the Y86 Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pipeline Enhancements for the Y86 Architecture

Similar presentations

Presentation on theme: "Pipeline Enhancements for the Y86 Architecture"— Presentation transcript:

Similar presentations

About project

Feedback