Presentation is loading. Please wait.

Presentation is loading. Please wait.

WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.

Similar presentations


Presentation on theme: "WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9."— Presentation transcript:

1 WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

2 E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) 64-bit datapath 64-bit addressing and high precision computing 64-bit adder 64 bit 64 bit

3 E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) 64-bit datapath 64-bit addressing and high precision computing 16-bit adder 64 bit 64 bit

4 E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) 16-bit integer datapath 64-bit addressing and high precision computing 40% of computations need only a 16-bit datapath Caveat: 64-bit computation becomes 8 * 16-bit computations (DBT) 16-bit adder

5 E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) What does non-productive mean? 0 x x x

6 E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) What does non-productive mean? 0 x x x

7 E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) What does non-productive mean? 0 x x x

8 E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) Contributions and conclusions 1.Narrow ISA offers more opportunities to remove non-productive memory operations 2.50 % of dynamic narrow operations are non-productive 3.Memory Productiveness Pruning: profile-guided, dynamic optimization

9 E NERGY EFFICIENT CODE GENERATION FOR PROCESSORS WITH EXPOSED DATAPATH D ONGRUI S HE, Y IFAN H E, B ART M ESMAN, H ENK C ORPORAAL (TUE) Exposed datapath: software controls every movement in the data path Example: transport-triggered architecture (Henk Corporaal) Register file access reduction

10 R EGISTER R EUSE S CHEDULING G ERGÖ B ARANY Objective Minimize spill code by attempting to find an instruction schedule that allows for the least expensive register allocation Motivation Spill code generated by the compiler has crucial effect on program performance Method Implicitly enforce instruction scheduling decisions by adding extra arcs to the data dependence graph (DDG) Results 8.9% less spilling, 3.4% smaller static spill costs

11 Register Allocation and spilling R EGISTER R EUSE S CHEDULING Virtual registers Physical registers Memory

12 Register Allocation with reuse candidates R EGISTER R EUSE S CHEDULING basic block interference graph definitely overlap definitely NO overlap possible overlap data dependence graph

13 Register Allocation with reuse candidates R EGISTER R EUSE S CHEDULING

14 D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING M OUNIRA B ACHIR, S ID -A HMED -A LI T OUATI, A LBERT C OHEN Objective Minimize the unrolling factor resulting from periodic register allocation of a software-pipelined loop, without altering the initiation interval (II) Motivation Code size related with memory requirements and I-cache performance Method Strategically insert move operations without increasing II to split meeting graph components into smaller ones ResultsGood if enough functional units to perform the additional move operations and acceptable execution time

15 Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File R

16 Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File Move operations d-1 MOVs/iteration d : iteration span of variables

17 Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File Move operations Loop unrolling 3 * code size

18 Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File Move operations Loop unrolling Modulo Variable Expansion a[i] b[i] c[i] a[i+1] b[i+1] c[i+1] a[i+2] b[i+2] c[i+2] using 9 registers instead of 8 MAXLIVE = 8

19 Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File Move operations Loop unrolling Modulo Variable Expansion Meeting Graph lifetime in cycles lifetime interval of c ends when interval of b begins

20 Meeting Graph D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING a[i] b[i] c[i] a[i+1] b[i+1] c[i+1] a[i+2] b[i+2] c[i+2] a[i+3] b[i+3] c[i+3] a[i+4] b[i+4] c[i+4] a[i+5] b[i+5] c[i+5] a[i+6] b[i+6] c[i+6] a[i+7] b[i+7] c[i+7]

21 Circuit Decomposition D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING

22 2011 INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION Main Conference

23 MAO – AN EXTENSIBLE M ICRO -A RCHITECTURAL O PTIMIZER R OBERT H UNDT, E ASWARAN R AMAN, M ARTIN T HURESSON, N EIL V ACHHARAJANI (G OOGLE ) Micro-architectural: not always documented Proprietary compilers at advantage! SPEC2000 int Loop SPEC2000 int Loop NOP + 1 NOP instruction - 7% execution time

24 MAO – AN EXTENSIBLE M ICRO -A RCHITECTURAL O PTIMIZER R OBERT H UNDT, E ASWARAN R AMAN, M ARTIN T HURESSON, N EIL V ACHHARAJANI (G OOGLE ) Micro-architectural: not always documented Example: instruction decoding in Core 2 in chunks of 16 bytes SPEC2000 int Loop SPEC2000 int Loop NOP 16-byte alignment boundary

25 MAO – AN EXTENSIBLE M ICRO -A RCHITECTURAL O PTIMIZER R OBERT H UNDT, E ASWARAN R AMAN, M ARTIN T HURESSON, N EIL V ACHHARAJANI (G OOGLE ) Contributions and conclusions 1.Extensible assembly to assembly optimizer 2.Does not fit in GCC flow, because after RTL level not enough information preserved 3.Discover micro-architectural details semi- automatically through generation of micro-benchmarks

26 D YNAMIC REGISTER PROMOTION OF STACK VARIABLES J IANJUN L I, C HENGGANG W U, W EI -C HUNG H SU Use DBT to let x86 binaries use the extra registers on x86-64 recompiling is not always an option (legacy binaries) compute-intensive applications gain speed when using 64-bit Challenge: implicit stack accesses Solved using page protection and stack switching (with shadow stack)

27 L ANGUAGE AND COMPILER SUPPORT FOR AUTO - TUNING VARIABLE - ACCURACY ALGORITHMS J ASON A NSEL, Y EE L OK W ONG, C Y C HAN, M AREK O LSZEWSKI, A LAN E DELMAN, S AMAN A MARASINGHE (MIT) PetaBricks: language extensions to expose trade-offs between time and accuracy to the compiler 1.New programming language, toolchain and run-time environment 2.Technique for mapping variable accuracy code to enable auto- efficient tuning

28 P RACTICAL MEMORY CHECKING WITH D R. M EMORY D EREK B RUENING (G OOGLE ), Q IN Z HAO (MIT) x86 Existing memory checking tools (e.g. Valgrind) slow many false positives

29 A TRACE - BASED J AVA JIT COMPILER RETROFITTED FROM A METHOD - BASED COMPILER H IROSHI I NOUE, H IROSHIGE H AYASHIZAKI, P ENG W U, T OSHIO N AKATANI (IBM) Extend the compilation scope from methods to traces Traces span multiple method invocations More powerful than method inlining

30 A TRACE - BASED J AVA JIT COMPILER RETROFITTED FROM A METHOD - BASED COMPILER H IROSHI I NOUE, H IROSHIGE H AYASHIZAKI, P ENG W U, T OSHIO N AKATANI (IBM) Claim: current trace-JITs are immature Keep the advanced optimization infrastructure by retrofitting

31 P HASE - BASED T UNING FOR B ETTER U TILIZATION OF P ERFORMANCE -A SYMMETRIC M ULTICORE P ROCESSORS T YLER S ONDAG AND H RIDESH R AJAN Objective Design and apply a transparent and fully-automatic process called phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores Motivation Trend towards performance asymmetry among cores of a single chip Method Statically partition the application into code sections that are likely to have similar runtime behavior. Exhibited runtime characteristics of representative sections are used to map the whole cluster Results 36% average process speedup with negligible overheads

32 Phase-based tuning P HASE - BASED T UNING FOR B ETTER U TILIZATION OF P ERFORMANCE -A SYMMETRIC M ULTICORE P ROCESSORS

33 V APOR SIMD: A UTO -V ECTORIZE O NCE, R UN E VERYWHERE D ORIT N UZMAN, S ERGEI D YSHEL, E RVEN R OHOU, I RA R OZEN, A LBERT C OHEN, A YAL Z AKS Objective Design and a split vectorization framework and study how it compares to monolithic one Motivation JIT compiler technology offers portability while facilitating target – and context-specific specialization; SIMD hardware is ubiquitous and diverse Method Mix-and-match existing open compilation tools, namely GCC and MONO Results Comparable to specialized monolithic offline compilers

34 Vectorizing for different platforms V APOR SIMD: A UTO -V ECTORIZE O NCE, R UN E VERYWHERE

35 Split vectorization scheme V APOR SIMD: A UTO -V ECTORIZE O NCE, R UN E VERYWHERE

36 Interoparable compilation flows V APOR SIMD: A UTO -V ECTORIZE O NCE, R UN E VERYWHERE

37

38


Download ppt "WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9."

Similar presentations


Ads by Google