Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS June 3 rd, 2014.

Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS kaiming.ho@iis.fraunhofer.de June 3 rd, 2014 Kaiming Ho

Overview Introduction and system description Motivation for work Optimization approach – using user defined instructions (UDI) – using multi-threading (MT) Results Concluding remarks Kaiming Ho2

Video Encoder System (Overview) video in (1080p30) ethernet out (1000Mbps) DDR memory memory dedicated hardware MIPS processor running s/w -encoded byte stream (IP/UDP/RTP) -statistics (IP/UDP) Kaiming Ho3 ff 4c ff 51 00 2f 00 00 07 80 00 04 38 00 ff 93 f3 b6...

Overview of software Main software is partitioned into three parts – Each part must finish before the next starts PART2 (codestream formation) PART3 (output to network) DONE PART1 (rate optimization) from h/w Timestamps are added to measure how long each part takes. Add up time for all three parts for performance metric. – convert absolute time to frames/sec. (33.33ms -> 30fps) s/w also instrumented to count instructions. – can calculate instr./cycle (IPC) h/w delivers input at 30 fps. Analyze rate at which s/w is done. – visualize in GUI Kaiming Ho4

Visualization GUI Kaiming Ho5 Performance before all optimizations

Optimization approach 1.Identify functional hot-spots which can be replaced by user-defined custom instructions (UDI). – base instruction-set is extended – One custom instruction replaces many instructions from the base-ISA. – Highest impact when # instructions replaced is high function is called often. 2.Use multi-threading (MT) to run all three parts simultaneously. – stalls in execution pipeline reduce instructions/cycle (IPC). – when one thread stalls, attempt to schedule an instruction from another thread. – increases effective IPC. Kaiming Ho6

Using User-defined instructions (UDI) MIPS UDI allows complex functions to be implemented in a single custom instruction. – ISA is extended to include new custom instructions – Fully supported in compiler tool-chain. Instructions take the form: reg_result = custom_udi(reg_src1, reg_src2); – Two 32-bit source operands (both optional) and one 32-bit result (also optional). – Typical RISC style. – Instructions can be pure (no side-effects), or can update internal state. Instructions are likely domain specific. Kaiming Ho7

UDI Examples (1) Bit accumulation, with zero-stuffing. – hard for 32-bit processor to do. bits are pushed into an accumulator. When eight 1’s in a row occur, an extra “0” is added. data is popped out 16/32-bits at a time. bitwr_push 0x1f2, 10 0 11 1 1 01 1 01 bitwr_push 0xfd, 8 bitwr_getlen r10 (r10 <= 19) bitwr_pop16 r11 (r11 <= 0xecff) 1 11 1 1 1 0 1 0 bitwr_push 0x17ffd, 18 01 0 1 11 1 1 111 1 1 0 1 1 0 1 1 accumulator state Kaiming Ho8 0 01

UDI Examples (2) FIFO pointer management. – not domain specific. Could find use in multiple applications. ring_start ring_end rd_ptr wr_ptr Kaiming Ho9 struct { unsigned *ring_start; unsigned *ring_end; unsigned *wr_ptr; unsigned *rd_ptr; } FIFO_PTR; unsigned *FIFO_PTR_INC_WP() { unsigned *retval, *next_wp; next_wp = retval = FIFO_PTR.wr_ptr; // increment and wrap next_wp += 1; if (next_wp == FIFO_PTR.ring_end) next_wp = FIFO_PTR.ring_start; // check for full if (next_wp == FIFO_PTR.rd_ptr) return NULL; FIFO_PTR.wr_ptr = next_wp; return retval; } Internal state: s/w writes one word at a time – check for buffer full – handle wraparound ptr = FIFO_PTR_INC_WP(); if (ptr) *ptr = data; PC: bfc059fc UDI r3 // inc_wp PC: bfc05a00 BEQZ r3, 0xbfc05ac8 PC: bfc05a04 NOP PC: bfc05a08 SW r3, 0(r3) FIFO_PTR_INC_WP() reduced to one atomic UDI Usage:

10Kaiming Ho UDI namecycles saved (per use) instr. saved (per use) freq. of use (per frame) overall speedup BIT WRITE (push)46-16129-7720889 22% BIT WRITE (get_len)46-10824-484185 BIT WRITE (pop)31-8216-423288 FIFO PTR (inc wp)39-10122-463288 1.9% FIFO PTR (inc rp)1689 UDI savings 13 instr, 38 cyc. 34 instr, 57 cyc. cycle count instr. count Two UDI replace 47 standard instructions, taking 95 cycles. UDI does not stall. Amount saved is dependent on input. # standard instructions variable. With UDI, always 2 instructions.

Performance gain from UDI Kaiming Ho11 Savings: 20.96ms (25%) 62.76ms (after) 83.72ms (before)

multi-threading (1) instructions/cycle (IPC) is a measure of efficiency in CPU execution pipeline. – stalls due to cache misses, multi-cycle instructions, branch penalties, etc… decrease IPC. A CPU working in multi-threaded mode attempts to schedule instructions from a different thread when one stalls. – increases effective IPC Programs with low IPC in single-threaded mode benefit most from multi-threading. Representative execution statistics of our program gathered in the lab: part1: 3056 cyc, 1587 instr. part2: 4597034 cyc, 1954337 instr. part3: 2454570 cyc, 816940 instr. total: 7054660 cyc, 2772864 instr. avg. IPC is 0.393 avg. IPC is low!! Expect MT to have significant impact Kaiming Ho12

30fps part1 part2 part3 frame1 part1 part2 part3 frame2 part1 part2 part3 frame3 part1 part2 part1 part2 thread1 thread2 thread3 TOO SLOW part1 part3 part2 part3 multi-threading (2) Execution of our program (in ST), over time is shown below. With MT, each part runs in its own thread, which are interleaved together, shown below. – When overlapped, the threads run slightly slower. Lower individual IPC. – overall effect is still better performance. Higher effective IPC TOO SLOW TOO SLOW Kaiming Ho 13 – Too slow. The 30fps time budget is overrun. With MT, each part runs in its own thread, which are interleaved together. – overall effect is better performance.

Multi-threading and IRQ handling Traditional ST programs get interrupted when external IRQs are asserted. – running of ‘normal’ program is interrupted with running IRQ handler. When MT programs are architected the same way, ALL threads are interrupted when IRQ occurs. – On IRQ, CPU goes to exception level and MT is effectively turned off. – very inefficient. When IRQ handler stalls, cycles are wasted. Our program takes many interrupts. (175k / sec.) Different approach: – IRQ handler is given its own thread. – Assertion of IRQ does not cause a CPU interrupt. They wake up the thread with the IRQ handler. – When IRQ handler runs, it is scheduled simultaneously with other threads in the system. – No IRQ overhead. – CPU never goes to exception level. Kaiming Ho14

Performance gain from MT Kaiming Ho15 45% Original performance: 83.72ms With UDI and MT: 43.37ms ST MT

Discussion of Results Adding UDI decreases #instr. and IPC. – custom instructions are part of multiplier pipeline. When MT is used, same # instr. takes longer. – IPC of individual threads lower – Overall IPC (performance) is higher. lower IPC in ST means greater gain from ST->MT Frequency of CPU does not matter – Our application is not I/O or memory bound. Kaiming Ho16 ST/noUDI (111MHz): 86.6ms. IPC 42.42% cyc.instr.IPC p1:2*1126113050.17% p2:2*3301726296715444.95% p3:2*1508140111421337.01% ST/UDI (111MHz): 65.4ms. IPC 39.39% cyc.instr.IPC p1:2*1126113050.17% p2:2*2118058174155441.17% p3:2*1508130111429137.01% MT/noUDI (111MHz): 68.6ms. cyc.instr.IPC p1:2*1458112538.58% p2:2*3745384296720139.61% p3:2*1508443108052435.89% 26% MT/UDI (111MHz): 43.8ms. cyc.instr.IPC p1:2*1973112528.50% p2:2*2435277174156335.76% p3:2*1508548107851535.83% 49% ST/UDI/rate_alloc (111MHz): 89.5ms. IPC 35.22% cyc.instr.IPC p1:2*133990463901323.84% p2:2*2115672174155441.17% p3:2*1508090111390737.00% MT/UDI/rate_alloc (111MHz): 57.3ms. (34/30/32) cyc.instr.IPC p1:2*153191563904119.25% p2:2*3187194174157427.34% p3:2*2249536105795123.56% 56% adding extra processing with memory accesses and FPU decreases IPC. effect of MT is enhanced. 98%

Concluding Remarks Over 50% improvement in performance was obtained by using two simple techniques: – Use of custom user-defined instructions (UDI) – Use of multi-threading (MT) technology. UDI reduces the number of instructions executed. Consistently saves 20-25%. – Easy to implement compared to dedicated h/w design. – man-weeks of work vs. man-years. Benefit of MT is more variable. – Between 26-49% has been measured. – depends on operating point. Image complexity. IPC of application. – Heavily loaded systems benefit more. – memory or I/O bound applications benefit more Kaiming Ho17

Kaiming Ho18 Achieving over 50% system speedup with custom instructions and multi-threading THANK YOU!!! way

Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS June 3 rd, 2014.

Similar presentations

Presentation on theme: "Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS June 3 rd, 2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS June 3 rd, 2014.

Similar presentations

Presentation on theme: "Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS June 3 rd, 2014."— Presentation transcript:

Similar presentations

About project

Feedback