Presentation is loading. Please wait.

Presentation is loading. Please wait.

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.

Similar presentations


Presentation on theme: "Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III."— Presentation transcript:

1 Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III

2 Advances in Microprocessor Technology

3 Architectural improvements of microprocessors Pipelining Paralle processing exploiting ILP Superscalar VLIW SIMD

4 Procedure of instruction execution on a processor Instruction Fetch (IF) fetches an instruction from main memory. Instruction Decode (ID) decodes fetched instruction Execution (EX) executes decoded instruction Memory Access (MA) accesses to main memory Write Back (WB) Write back data to registers

5 Operation cycles on a processor Single cycle machine This kinds of machines execute all procedures from IF to WB in a cycle. Operation speed is determined by the slowest instruction. (Because all instructions must be executed in a cycle) Multi-cycle machine This kinds of machines execute an instruction in several cycles. IFIDEXMAWB

6 Piepelined operation can improve throughput of instructions. IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB IFIDEXMAWBIFIDEXMAWBIFIDEXMAWB To realize pipelined operation, several techniques are required.

7 Causes of pipeline hazards Structural hazard: The hardware cannot cope with the combination of issued instructions. Data hazard: The latter instruction must wait completion of former instruction because the latter uses the result of the former. Control hazard: A condition that determines whether an instruction is executed or not depends on the result of the former instruction.

8 Structural hazard IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU

9 Structural hazard IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU

10 Structural hazard IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU

11 Structural hazard IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU

12 Structural hazard IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU MA IF conflict

13 Structural hazard IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU Resolve 1: to stall the next instruction

14 Structural hazard IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU Resolve 1: to stall the next instruction

15 Structural hazard IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU MA IF conflict Resolve 2: to add another data bus to access the instruction memory.

16 Structural hazard IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC Inst Mem Instruction decoder Instruction register ALURegisters CPU Data Mem Harvard Architecture Resolve 2: to add another data bus to access the instruction memory.

17 Data hazard IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU add $s0,$t0,$t1 ($s0=$t0+$t1) sub $t2,$s0,$t3 ($t2=$s0-$t3) Registers t0t1t2t3t4 s0s1s2s3s4

18 Data hazard IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU add $s0,$t0,$t1 ($s0=$t0+$t1) sub $t2,$s0,$t3 ($t2=$s0-$t3) $s0=$t0+$t1 Registers t0t1t2t3t4 s0s1s2s3s4

19 Data hazard IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU add $s0,$t0,$t1 ($s0=$t0+$t1) sub $t2,$s0,$t3 ($t2=$s0-$t3) $s0=$t0+$t1 Registers t0t1t2t3t4 s0s1s2s3s4 $t2=$s0-$t3

20 Data hazard IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU add $s0,$t0,$t1 ($s0=$t0+$t1) sub $t2,$s0,$t3 ($t2=$s0-$t3) $s0=$t0+$t1 Registers t0t1t2t3t4 s0s1s2s3s4 $t2=$s0-$t3 -2=0-2

21 Data hazard IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU add $s0,$t0,$t1 ($s0=$t0+$t1) sub $t2,$s0,$t3 ($t2=$s0-$t3) $s0=$t0+$t1 Registers t0t1t2t3t4 s0s1s2s3s4 Waiting by stalls: consuming 3 cycles

22 Data hazard IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU add $s0,$t0,$t1 ($s0=$t0+$t1) sub $t2,$s0,$t3 ($t2=$s0-$t3) $s0=$t0+$t1 Registers t0t1t2t3t4 s0s1s2s3s4 Resolve: forwarding

23 Data hazard IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU add $s0,$t0,$t1 ($s0=$t0+$t1) sub $t2,$s0,$t3 ($t2=$s0-$t3) $s0=$t0+$t1 Registers t0t1t2t3t4 s0s1s2s3s4 Resolve: forwarding The result is forwarded to ALU

24 Data hazard IFIDEXMAWB IFIDEXMAWB PC Memory Instruction decoder Instruction register ALURegisters CPU add $s0,$t0,$t1 ($s0=$t0+$t1) sub $t2,$s0,$t3 ($t2=$s0-$t3) $s0=$t0+$t1 Registers t0t1t2t3t4 s0s1s2s3s4 Resolve: forwarding $t2=9-$t3 7=9-2 The result is forwarded to ALU

25 Control hazard add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB An instruction sequence including branch PC:10 Instruction decoder Instruction register ALURegisters CPU ※ ※ In this explanation, PC adopts word address for simplification.

26 Control hazard add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC: Instruction decoder Instruction register ALURegisters CPU An instruction sequence including branch

27 Control hazard add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC:11 Instruction decoder Instruction register ALURegisters CPU An instruction sequence including branch

28 Control hazard add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC:12 Instruction decoder Instruction register ALURegisters CPU PC’s value of next instruction depends on the branch condition Branch is taken:PC=40 Not taken:PC=12 An instruction sequence including branch

29 Control hazard Resolve 1: stall add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB 2 cycle stall The number of required stall cycle aetermined by architecture.

30 Control hazard Resolve 1: stall add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB 1 cycle stall If the processor can calculate the branch target address at the ID stage.

31 Control hazard add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC:10 Instruction decoder Instruction register ALURegisters CPU Resolve 2: Branch prediction In this example, the next PC is predicted as if the branch is always untaken.

32 Control hazard add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC:11 Instruction decoder Instruction register ALURegisters CPU Resolve 2: branch prediction In this example, the next PC is predicted as if the branch is always untaken.

33 Control hazard add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC:12 Instruction decoder Instruction register ALURegisters CPU Resolve 2: branch prediction In this example, the next PC is predicted as if the branch is always untaken.

34 Control hazard Resolve 2: branch prediction add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB stall PC:40 Instruction decoder Instruction register ALURegisters CPU If the prediction is missed, in other words, if branch is taken.

35 Control hazard More practical scheme: dynamic branch prediction n-bit counter-based prediction: Address of a branch instraction Branch History Table Lower i-bit n-bit saturating up/down counter

36 1-bit counter-based prediction Predict branch will be taken Predict branch will be untaken 10 Branch is taken Branch is untaken

37 2-bit counter-based prediction Predict branch will be taken 00 Predict branch will be taken Branch is taken Branch is untaken This scheme is adopted in Intel Pentium, Sun Ultra SPARC, MIPS R10000,etc

38 Control hazard add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) Inserted instruction or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC:11 Instruction decoder Instruction register ALURegisters CPU Resolve 3: delayed prediction An instruction that has no dependency is inserted. IFIDEXMAWB

39 Resolve 3: delayed prediction Control hazard add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) Inserted instruction or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC:12 Instruction decoder Instruction register ALURegisters CPU IFIDEXMAWB An instruction that has no dependency is inserted.

40 Resolve 3: delayed prediction Control hazard add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) Inserted instruction or $s3,$s4,$t2 ($s3=$s4|$t2) IFIDEXMAWB IFIDEXMAWB IFIDEXMAWB PC:13or40 Instruction decoder Instruction register ALURegisters CPU IFIDEXMAWB An instruction at determined address is executed. An instruction that has no dependency is inserted.

41 Exploiting ILP (Instruction Level Parallelism) SuperScalar : issuing multiple instructions per cycle with hardware support. Advantage: binary compatibility. VLIW: issuing multiple instructions per cycle with compiler support. Advantage: simple hardware

42 Types of data dependence True data dependence (RAW: Read After Write) Anti-dependence (WAR: Write After Read) Output dependence (WAW: Write After Write) i1: r2=r1+r3 i2: r4=r2+1 i1: r1=r2+r3 i2: r2=r4+1 i3: r1=r4+2 Anti Output difficult to remove can be removed by register renaming They are called as artificial dependence

43 Basic Architecture of Superscaler Processor Instruction cache Instruction decode Register renaming Branch prediction Function unit Registers ・・・・・ Data cache Reorder buffer ・・・・・ Frontend Ex-core Back end dispatch Instruction window commit issue

44 Basic function of Frontend provides enough instructions. predicts next instruction address if branch instruction appears. resolves artificial dependences by register renaming. analyzes true data dependence after register renaming. transfers instructions after the above operations. This operation is called “dispatch”.

45 Basic function of Ex-core finds independent instructions stored in “instruction window” as many as possible. In this operation, dynamic scheduling is performed to resolve several restrictions: data dependence, resource, prior defined priority, etc. executes independent instructions in parallel. An operation that transfers an instruction to a function unit is called “issue”.

46 Basic function of Backend updates processor state. Results obtained as out-of-order are reordered to in-order. Update of the processor state is performed precisely. Update of the processor state based on the execution result is called “commit”. Disappear of instruction is called “retire”.

47 Dynamic instruction scheduling Instruction scheduling means to determine issuing order of instructions and when the instructions are issued. In superscalar processors, dynamic instruction scheduling is performed using instructions stored in the instruction buffer. In the following slides, dynamic scheduling will be explained using several types of processors:1-way in-order processor, i-way in-order processro, and i- way out-of-order processor.

48 1 way in-order issue The number of issued instructions at a cycle is at most 1. The size of instruction window is 1 because all subsequent instructions cannot be issued if an instruction cannot be issued. Only true and output dependences should be checked because anti dependence is always resolved.

49 Control by R flag R flag is used to check true and output dependences. opdstsrc1src2 Rvalue R R R R R R R Instruction Registers Register number Only when R(dst) == true && R(src1) ==true && R(src2), the instruction is issued. (This condition is called “ready”.) R==false means the register is reserved but the result has not been stored yet. In this case, the operand is not available.

50 Update sequence of the R flag R bit of destination becomes false when an instruction is issued. R bit of destination becomes true when a result is stored in the destination. by the above update, Instructions using unavailable registers as source registers are not issued; true dependence is resolved. Instructions using unavailable a register as a destination register are not issued; output dependence is resolved. Practically, resource restrictions must be satisfied to issue instructions in addition to the check of dependency. In this lecture, only restriction about function unit is considered to simplify the discussion.

51 i-way in-order issue We think about how the following 4 instructions are executed on this processor. i1: r1 = r5 i2: r2 = r1 + 1 i3: r3 = r6 i4: r4 = r3 +1 CycleFunciont Unit0Function Unit1 0i1: r1=r5 1i2: r2 = r1 + 1i3: r3 = r6 2i4: r4 = r3 + 1 In-order scheduling IPC becomes 1.3. (4instcuctions/3 cycle)

52 How to check dependency of instructions? True and output dependence must be checked. Instruction 0 Instruction i-1 :::: i Instruction window Rvalue Registers Rvalue :::::::::: Register number 3 × i i

53 How to allocate resources(funciton unit)? Allocation of is performed as follows. Check whether any of preceding ready instructions refers or not. If there is no instructions refering, the function unit is available. Repeat the above procedure from to, where means the number of function units.

54 Complexity of i-way in-order issue Ready detection ports are required. comparators are required for check of operand dependency. Resource allocation input NOR gate is required. Complexity increases by

55 i-way out-of-order issue Out-of-order scheduling of the same code used in the previous i-way in- order case. i1: r1 = r5 i2: r2 = r1 + 1 i3: r3 = r6 i4: r4 = r3 +1 CycleFunciont Unit0Function Unit1 0i1: r1=r5i3: r3 = r6 1i2: r2 = r1 + 1i4: r4 = r3 + 1 Out-of-order scheduling IPC becomes 2.0. (4instcuctions/2 cycle)

56 Architectural requirements for out-of-order execution The depth of instruction window should be increased to. The number of registers’ ports must be for check of dependence. Anti-dependence must be checked, in addition to the i-way in-order case. Resource allocation can be performed in the same way as the i-way in-order case.

57 Complexity of i-way in-order issue Ready detection ports are required. comparators are required for check of operand dependency. Resource allocation input NOR gate is required. Complexity increases by Increase of hardware complexity is more significant than the in-order case because n>>i in general.

58 Tomasulo’s Algorithm was proposed by R.M. Tomasulo in was originally adopted in floating point unit in IBM 360/91. Performance was drastically improved. Similar algorithms are used in the latest microprocessors.

59 Superscalar arch using Tomasulo Instruction cache Instruction decode Tag allocation Branch prediction Function unit Registers ・・・・・ Data cache ・・・・・ Frontend Ex-core dispatch issue ・・・・・ Reservation Station

60 Contents of reservation station and register Register Tag is used for register renaming. Reservation station op: opecode dtag: destination tag stag: source tag R: ready flag value: operand’s value valuetagR valuestagRvaluestagRdtagop Source 1Source 2

61 Operation on the arch Dispatch Issue Execution Finalization

62 Operation on the arch Dispatch dtag is assigned to a destination operand from tag pool that holds unassigned tags. Src operands are obtained by reading registers using each register number. If R is true, then value is read, otherwise tag’s value is read from the register. Then, an instruction is stored in a reservatoin station corresponding to a function unit used in the instruction.

63 Operation on the arch Issue A ready instruction in a reservation is executed on a corresponding function unit, if the function unit is available. The issued instruction is deleted from the reservation station. Execution Issued instructions are executed on corresponding function units.

64 Operation on the arch Finalize Based on a result of execution, dtag and a result value is broadcasted to the result bus. If there is an instruction holds the broadcasted dtag as stag, R flag and value of the instruction is replaced by true and the broadcasted result value, respectively. Only when there is a register holding a tag corresponding to broadcasted dtag, the broadcasted result is stored in the register. Finally, the broadcasted tag is stored to tag pool.

65 An example of Tomasulo A superscalar processor used in this example has the following 5 stage pipeline and the number of way is 2. IF: fetches 2 instructions. ID: decodes, allocates tags, and dispatches. RS: waits operands until an instruction becomes ready. EX: executes an instruction. WB: writes a result. i1: r1 = load A i2: r2 = r1 + 3 i3: r3 = r2 + 1 i4: r4 = load B #A and B are const

66 Cycle 0 opDestinationSource 1Source 2 R dtagval RstagvalRstagval Instruction i1: r1 = load A i2: r2 = r1 +r3 i3: r4 = r2 + 1 i4: r2 = load B Stage State of instructions #RtagVal 11X2 21X4 31X7 41X9 Registers 30 ・・・・・・ Tag pool

67 Cycle 1 opDestinationSource 1Source 2 R dtagval RstagvalRstagval Instruction i1: r1 = load A i2: r2 = r1 +r3 i3: r4 = r2 + 1 i4: r2 = load B Stage IF State of instructions #RtagVal 11X2 21X4 31X7 41X9 Registers 30 ・・・・・・ Tag pool

68 Cycle 2 opDestinationSource 1Source 2 R dtagval RstagvalRstagval load050X1XA1X0 add051X050X1X7 Instruction i1: r1 = load A i2: r2 = r1 +r3 i3: r4 = r2 + 1 i4: r2 = load B Stage ID IF State of instructions #RtagVal 1050X 2051X 31X7 41X9 Registers 30 ・・・・・・ Tag pool

69 Cycle 3 opDestinationSource 1Source 2 R dtagval RstagvalRstagval load150151XA1X0 add051X050X1X7 add052X load053X Instruction i1: r1 = load A i2: r2 = r1 +r3 i3: r4 = r2 + 1 i4: r2 = load B Stage EX RS ID State of instructions #RtagVal 1050X 2053X 31X7 4052X Registers 30 ・・・・・・ 54 Tag pool

70 Cycle 4 opDestinationSource 1Source 2 R dtagval RstagvalRstagval load150151XA1X0 add X7 add052X051X1X1 load153161XB1X0 Instruction i1: r1 = load A i2: r2 = r1 +r3 i3: r4 = r2 + 1 i4: r2 = load B Stage WB EX RS EX State of instructions #RtagVal 11X X 31X7 4052X Registers 5030 ・・・・・・ 54 Tag pool

71 Cycle 5 opDestinationSource 1Source 2 R dtagval RstagvalRstagval add X7 add X1 load153161XB1X0 Instruction i1: r1 = load A i2: r2 = r1 +r3 i3: r4 = r2 + 1 i4: r2 = load B Stage WB EX WB State of instructions #RtagVal 11X15 21X16 31X7 4052X Registers ・・・・・・ 54 Tag pool

72 Cycle 6 opDestinationSource 1Source 2 R dtagval RstagvalRstagval add X1 Instruction i1: r1 = load A i2: r2 = r1 +r3 i3: r4 = r2 + 1 i4: r2 = load B Stage WB State of instructions #RtagVal 11X15 21X16 31X7 41X23 Registers ・・・・・・ 54 Tag pool

73 Problem of out-of-order execution It is difficult to update the processor state precisely if exception is occurred. Fin i0: ・・・・ Fin i1: ・・・・ Fini2: r1=load r1 Ei3: r2=load r3 i4: ・・・・ i5: r3 = r4 << r2 i6: ・・・・ In order executionOut of order execution Fin i0: ・・・・ i1: ・・・・ Fini2: r1=load r1 Ei3: r2=load r3 i4: ・・・・ Fini5: r3 = r4 << r2 i6: ・・・・

74 Flow of exception handling Unfinished instructions that include an instruction causes the exception is invalidated. Control is moved to OS to save the current state to main memory and to handle the exception. After the process of the exception, CPU begins to execute the instruction causing the exception again.

75 Problem of out-of-order execution It is difficult to update the processor state precisely if exception is occurred. Fin i0: ・・・・ Fin i1: ・・・・ Fini2: r1=load r1 Ei3: r2=load r3 i4: ・・・・ i5: r3 = r4 << r2 i6: ・・・・ In order execution Save the current state. OS handles the exception. CPU restarts from i3.

76 Problem of out-of-order execution It is difficult to update the processor state precisely if exception is occurred. Out of order execution Fin i0: ・・・・ i1: ・・・・ Fini2: r1=load r1 Ei3: r2=load r3 i4: ・・・・ Fini5: r3 = r4 << r2 i6: ・・・・ Save the current state. i5 has finished before i3. i1 has not finished. the data of r3 has been lost. OS handles the exception. CPU cannot restart from i3. Reorder buffer is used for precise exception handling.

77 Reorder buffer Updates CPU’s state in the original program order by reordering results. Handles exception at the state update. Reorder Buffer Registers Results and information about exception Store of results in the original program order and detection of exception. Commit

78 Superscalar arch using Tomasulo and reorder buffer Instruction cache Instruction decode Tag allocation Branch prediction Function unit Registers ・・・・・ Data cache ・・・・・ Frontend Ex-core dispatch issue ・・・・・ Reservation Station Reorder Buffer Backend commit

79 Behaviour of reorder buffer If there is result without an exception, it is stored to a register and the entry corresponding to it is removed. There is a result with an exception, pipeline and reorder buffer are cleared. If a result is not stored, reorder buffer waits until the result is obtained.

80 Contents of reorder buffer PC: instruction address R: Ready flag dreg: register number of destination dtag: operand tag of destination E: Exception flag result: result resultEdtagdregRPC

81 Operand bypass and supply of source operand tag Tomasulo: operand values are obtained from registers that have the latest values. Reorder buffer: the latest values are stored in reorder buffer. (not in registers) Procedure of obtaining operands: Check dependency to instructions decoded concurrently. If there is dependency, stag becomes dtag of the dependent instruction. Otherwise, reorder buffer is searched by source register number to obtain value (when R=1) or tag. (when R=0) If reorder buffer does not have value and tag corresponding to the register number, values are obtained from registers.

82 An example of reorder buffer A superscalar processor used in this example has the following 6 stage pipeline and the number of way is 2. IF: fetches 2 instructions. ID: decodes, allocates tags, and dispatches. RS: waits operands until an instruction becomes ready. EX: executes an instruction. WB: writes results to reorder buffer. RT: writes result to registers.

83 A code used in the example i1: 0x40: r1 = load A (r0) i2: 0x44: r2 = r1 + r3 i3: 0x48: r2 = r i4: 0x4C: r5 = load 0 (r1) i5: 0x50: r1 = r1 + 1 i6: 0x54: r2 = load 0 (r2) Address of instruction

84 Cycle 0 opDestinationSource 1Source 2 E dtagval RstagvalRstagval Instruction i1:r1=load A(r0) i2:r2= r1+r3 i3:r2=r2+16 i4:r5=load 0(r1) i5:r1=r1+1 i6:r2=load 0(r2) Stage State of instructions pointerenrtyPCRdregdtagEresult h/t Reorder buffer

85 Cycle 1 opDestinationSource 1Source 2 E dtagval RstagvalRstagval Instruction i1:r1=load A(r0) i2:r2= r1+r3 i3:r2=r2+16 i4:r5=load 0(r1) i5:r1=r1+1 i6:r2=load 0(r2) Stage IF State of instructions pointerenrtyPCRdregdtagEresult H/T Reorder buffer

86 Cycle 2 opDestinationSource 1Source 2 E dtagval RstagvalRstagval loadX20X1XA1X0 addX21X020X1X7 Instruction i1:r1=load A(r0) i2:r2= r1+r3 i3:r2=r2+16 i4:r5=load 0(r1) i5:r1=r1+1 i6:r2=load 0(r2) Stage ID IF State of instructions pointerenrtyPCRdregdtagEresult Head XX XX Tail Reorder buffer

87 Cycle 3 opDestinationSource 1Source 2 E dtagval RstagvalRstagval load020151XA1X0 addX21X020X1X7 addX22X021X1X16 loadX23X1X0020X Instruction i1:r1=load A(r0) i2:r2= r1+r3 i3:r2=r2+16 i4:r5=load 0(r1) i5:r1=r1+1 i6:r2=load 0(r2) Stage EX RS ID IF State of instructions pointerenrtyPCRdregdtagEresult Head XX XX XX 234C0523XX Tail24 25 Reorder buffer

88 Cycle 4 opDestinationSource 1Source 2 E dtagval RstagvalRstagval load020151XA1X0 add X7 addX22X021X1X16 load123?1X addX24X1X151X1 loadX25X1X0022X Instruction i1:r1=load A(r0) i2:r2= r1+r3 i3:r2=r2+16 i4:r5=load 0(r1) i5:r1=r1+1 i6:r2=load 0(r2) Stage WB EX RS EX ID State of instructions pointerenrtyPCRdregdtagEresult Head XX XX 234C0523XX XX XX Reorder buffer Tail

89 Cycle 5 opDestinationSource 1Source 2 E dtagval RstagvalRstagval add X7 add X16 load123?1X add024161X151X1 loadX25X1X0022X Instruction i1:r1=load A(r0) i2:r2= r1+r3 i3:r2=r2+16 i4:r5=load 0(r1) i5:r1=r1+1 i6:r2=load 0(r2) Stage RT WB EX WB EX RS State of instructions pointerenrtyPCRdregdtagEresult Head XX 234C15231? XX XX Reorder buffer Tail

90 Cycle 6 opDestinationSource 1Source 2 E dtagval RstagvalRstagval add X16 add024161X151X1 load02541X Instruction i1:r1=load A(r0) i2:r2= r1+r3 i3:r2=r2+16 i4:r5=load 0(r1) i5:r1=r1+1 i6:r2=load 0(r2) Stage RT WB RT WB EX State of instructions pointerenrtyPCRdregdtagEresult Head C15231? XX Reorder buffer Tail

91 Cycle 7 opDestinationSource 1Source 2 E dtagval RstagvalRstagval load02541X Instruction i1:r1=load A(r0) i2:r2= r1+r3 i3:r2=r2+16 i4:r5=load 0(r1) i5:r1=r1+1 i6:r2=load 0(r2) Stage RT WB State of instructions pointerenrtyPCRdregdtagEresult H/T234C15231? XX Reorder buffer Exception is detected.

92 VLIW (Very Long Instruction Word) In the VLIW processor, compiler extracts parallelism in a code. Therefore, special hardware support used in the superscalar processor becomes unnecesarry. Superscalar: dynamic scheduling by hardware support VLIW: static scheduling by compiler

93 Overview of VLIW compiler processor main(){ ・・・ ・・・ ・・・ ・・・ } add sub ・・・ code gen scheduling execution main(){ ・・・ ・・・ ・・・ ・・・ } add sub ・・・ code gen add sub load add mul load ・・・ scheduling Superscalar VLIW

94 VLIW code i1: r3=r4+1 i2: r1=load(r2) i3: r1=r1<

95 ・・・・・ Hardware organization of VLIW ALU MEM Branch Registers ・・・・・ Instruction cache Data cache

96 VLIW vs Superscalar SuperscalarVLIW Hardware sizeLargeSmall Hardware complexityLargeSmall Scheduling algorithmPoorRich Instruction windowSmallLarge Binary compatibilityCompatibleNot compatible

97 Dynamic vs Static schedluing i1: r1=load A i2: r2=load(r1) i3: r3=load B i4: r4=r3<

98 Advantage of dynamic scheduling Scheduling based on information that can only be obtained at run time. For example, cache miss can be concealed. Scheduling based on accurate dependency of memories. Data address that can be obtained only at run time improves scheduling performance.

99 Taxonomy of scheduling algorithm Local scheduling Global scheduling Cyclic scheduling Acyclic scheduling Trace-based scheduling DAG-based (Directed acyclic graph) scheduling

100 VLIW-based commercial processors Transmeta Crusoe Aiming mobile computing Texas Instruments TMS320C6x series Embedded applications Intel Itanium

101 Parallel operation by SIMD What is SIMD?: SIMD (Single Instruction Multiple Data) means that the same operation is applied to several operands. Ex: Addition int i; int a[4]={1,2,3,4}; int b[4]={5,6,7,8}; int c[4]; for (i=0;i<4;i++){ c[i]=a[i]+b[i]; } c[0]=a[0]+b[0] c[1]=a[1]+b[1] c[2]=a[2]+b[2] c[3]=a[3]+b[3] Sequential operation SIMD

102 SIMD data types (Cell/B.E.) vector unsigned char16 unsigned 8bit values vector signed char16 signed 8bit values vector unsigned short8 unsigned 16 bit values vector signed short8 signed 16 bit values vector unsigned int4 unsigned 32 bit values vector signed int4 signed 32 bit values vector unsigned long long2 unsigned 64 bit values vector signed long long2 signed 64 bit values vector float4 32bit floating vlaues vector double2 64 bit double (floating) values

103 Allocation of vector values Vector values are allocated to memory in the big-endian style as shown in the following figure. *This figure is adapted from cell.fixstars.com

104 How to access vector type via normal pointer vector signed int va = (vector signed int) { 1, 2, 3, 4 }; int *a = (int *) &va; *This figure is adapted from cell.fixstars.com

105 How to access a normal array from vector type int a[8] __attribute__((aligned(16))) = { 1, 2, 3, 4, 5, 6, 7, 8 }; vector signed int *va = (vector signed int *) a; *This figure is adapted from cell.fixstars.com __attribute__((aligned(16))) forces scalar data to be 16 byte-aligned

106 SIMD operation on PPE int a[4] __attribute__((aligned(16))) = { 1, 2, 3, 4 }; int b[4] __attribute__((aligned(16))) = { 5, 6, 7, 8 }; int c[4] __attribute__((aligned(16))); vector signed int *va = (vector signed int *) a; vector signed int *vb = (vector signed int *) b; vector signed int *vc = (vector signed int *) c; *vc = vec_add(*va, *vb); vec_add is a SIMD function provided by VMX (Vector Multimedia Extension) proposed by IBM and Mtorola.

107 Entire code for vector addition #include int a[4] __attribute__((aligned(16))) = { 1, 2, 3, 4 }; int b[4] __attribute__((aligned(16))) = { 5, 6, 7, 8 }; int c[4] __attribute__((aligned(16))); int main(int argc, char **argv) { vector signed int *va = (vector signed int *) a; vector signed int *vb = (vector signed int *) b; vector signed int *vc = (vector signed int *) c; *vc = vec_add(*va, *vb); printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]); return 0; }

108 A part of VMX function Arithmetic operation vec_add(a,b)a+b vec_sub(a,b)a-b vec_madd(a,b,c)a*b+c Logical operation vec_and(a,b)Logical and vec_or(a,b)Logical or Bit operationvec_perm(a,b,c)creating new vector from a[i] and b[i] based on c[i] vec_sel(a,b,c)selecting a[i] or b[i] basedon c[i] branchvec_cmpeq(a, b)a[i]==b[i] vec_cmpgt(a, b)a[i]>b[i] Type conversion vec_ctf(a, b)(float)a[i]/(2^b) vec_ctu(a, b)(unsigned int) a[i]/(2^b) Generating constant vec_splat(a, b)a[b] vec_splat s32(a)signed a[i]

109 How to create dense vector data In general, vector data is not densely stored. Threfore, dense vector data must be created before vector operation. vc = vec_perm(va, vb, vpat); *This figure is adapted from cell.fixstars.com

110 Ex of vec_perm: Transpose *These figures are adapted from cell.fixstars.com

111 Branch on SIMD *These figures are adapted from cell.fixstars.com

112 Procedure of SIMD Branch *These figures are adapted from cell.fixstars.com

113 Detail of SIMD Branch vec_cmpgt() vec_sel() *These figures are adapted from cell.fixstars.com

114 Ex of SIMD Branch int a[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }; int b[16] = { 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4 3, 2, 1 }; int c[16]; int i; for (i = 0; i < 16; i++) { if (a[i] > b[i]) { c[i] = a[i] - b[i]; } else { c[i] = b[i] - a[i]; }

115 Ex of SIMD Branch int a[16] __attribute__((aligned(16))) = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }; int b[16] __attribute__((aligned(16))) = { 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 }; int c[16] __attribute__((aligned(16))); vector signed int *va = (vector signed int *) a; vector signed int *vb = (vector signed int *) b; vector signed int *vc = (vector signed int *) c; vector signed int vc_true, vc_false; vector unsigned int vpat; int i; for (i = 0; i < 4; i++) { vpat = vec_cmpgt(va[i], vb[i]); vc_true = vec_sub(va[i], vb[i]); vc_false = vec_sub(vb[i], va[i]); vc[i] = vec_sel(vc_false, vc_true, vpat); }


Download ppt "Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III."

Similar presentations


Ads by Google