WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.

WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9

E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) 64-bit datapath 64-bit addressing and high precision computing 64-bit adder 64 bit 64 bit

E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) 64-bit datapath 64-bit addressing and high precision computing 16-bit adder 64 bit 64 bit

E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) 16-bit integer datapath 64-bit addressing and high precision computing 40% of computations need only a 16-bit datapath Caveat: 64-bit computation becomes 8 * 16-bit computations (DBT) 16-bit adder

E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) What does non-productive mean? 0 x 0000 0000 0000 0001 0 x 0000 0000 0000 0025 0 x 0000 0000 0000 0026 +

E LIMINATING NON - PRODUCTIVE MEMORY OPERATIONS IN NARROW - BITWIDTH ARCHITECTURES I NDU B HAGAT, E NRIC G IBERT, J ESÚS S ÁNCHEZ, A NTONIO G ONZÁLEZ (UPC) Contributions and conclusions 1.Narrow ISA offers more opportunities to remove non-productive memory operations 2.50 % of dynamic narrow operations are non-productive 3.Memory Productiveness Pruning: profile-guided, dynamic optimization

E NERGY EFFICIENT CODE GENERATION FOR PROCESSORS WITH EXPOSED DATAPATH D ONGRUI S HE, Y IFAN H E, B ART M ESMAN, H ENK C ORPORAAL (TUE) Exposed datapath: software controls every movement in the data path Example: transport-triggered architecture (Henk Corporaal) Register file access reduction

R EGISTER R EUSE S CHEDULING G ERGÖ B ARANY Objective Minimize spill code by attempting to find an instruction schedule that allows for the least expensive register allocation Motivation Spill code generated by the compiler has crucial effect on program performance Method Implicitly enforce instruction scheduling decisions by adding extra arcs to the data dependence graph (DDG) Results 8.9% less spilling, 3.4% smaller static spill costs

Register Allocation and spilling R EGISTER R EUSE S CHEDULING Virtual registers Physical registers Memory

Register Allocation with reuse candidates R EGISTER R EUSE S CHEDULING basic block interference graph definitely overlap definitely NO overlap possible overlap data dependence graph

Register Allocation with reuse candidates R EGISTER R EUSE S CHEDULING

D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING M OUNIRA B ACHIR, S ID -A HMED -A LI T OUATI, A LBERT C OHEN Objective Minimize the unrolling factor resulting from periodic register allocation of a software-pipelined loop, without altering the initiation interval (II) Motivation Code size related with memory requirements and I-cache performance Method Strategically insert move operations without increasing II to split meeting graph components into smaller ones ResultsGood if enough functional units to perform the additional move operations and acceptable execution time

Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File R

Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File Move operations d-1 MOVs/iteration d : iteration span of variables

Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File Move operations Loop unrolling 3 * code size

Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File Move operations Loop unrolling Modulo Variable Expansion a[i] b[i] c[i] a[i+1] b[i+1] c[i+1] a[i+2] b[i+2] c[i+2] using 9 registers instead of 8 MAXLIVE = 8

Periodic Register Allocation D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING Rotating Register File Move operations Loop unrolling Modulo Variable Expansion Meeting Graph lifetime in cycles lifetime interval of c ends when interval of b begins

Meeting Graph D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING a[i] b[i] c[i] a[i+1] b[i+1] c[i+1] a[i+2] b[i+2] c[i+2] a[i+3] b[i+3] c[i+3] a[i+4] b[i+4] c[i+4] a[i+5] b[i+5] c[i+5] a[i+6] b[i+6] c[i+6] a[i+7] b[i+7] c[i+7]

Circuit Decomposition D ECOMPOSING M EETING G RAPH C IRCUITS TO M INIMISE K ERNEL L OOP U NROLLING

2011 INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION Main Conference

MAO – AN EXTENSIBLE M ICRO -A RCHITECTURAL O PTIMIZER R OBERT H UNDT, E ASWARAN R AMAN, M ARTIN T HURESSON, N EIL V ACHHARAJANI (G OOGLE ) Micro-architectural: not always documented Proprietary compilers at advantage! SPEC2000 int Loop SPEC2000 int Loop NOP + 1 NOP instruction - 7% execution time

MAO – AN EXTENSIBLE M ICRO -A RCHITECTURAL O PTIMIZER R OBERT H UNDT, E ASWARAN R AMAN, M ARTIN T HURESSON, N EIL V ACHHARAJANI (G OOGLE ) Micro-architectural: not always documented Example: instruction decoding in Core 2 in chunks of 16 bytes SPEC2000 int Loop SPEC2000 int Loop NOP 16-byte alignment boundary

MAO – AN EXTENSIBLE M ICRO -A RCHITECTURAL O PTIMIZER R OBERT H UNDT, E ASWARAN R AMAN, M ARTIN T HURESSON, N EIL V ACHHARAJANI (G OOGLE ) Contributions and conclusions 1.Extensible assembly to assembly optimizer 2.Does not fit in GCC flow, because after RTL level not enough information preserved 3.Discover micro-architectural details semi- automatically through generation of micro-benchmarks

D YNAMIC REGISTER PROMOTION OF STACK VARIABLES J IANJUN L I, C HENGGANG W U, W EI -C HUNG H SU Use DBT to let x86 binaries use the extra registers on x86-64 recompiling is not always an option (legacy binaries) compute-intensive applications gain speed when using 64-bit Challenge: implicit stack accesses Solved using page protection and stack switching (with shadow stack)

L ANGUAGE AND COMPILER SUPPORT FOR AUTO - TUNING VARIABLE - ACCURACY ALGORITHMS J ASON A NSEL, Y EE L OK W ONG, C Y C HAN, M AREK O LSZEWSKI, A LAN E DELMAN, S AMAN A MARASINGHE (MIT) PetaBricks: language extensions to expose trade-offs between time and accuracy to the compiler 1.New programming language, toolchain and run-time environment 2.Technique for mapping variable accuracy code to enable auto- efficient tuning

P RACTICAL MEMORY CHECKING WITH D R. M EMORY D EREK B RUENING (G OOGLE ), Q IN Z HAO (MIT) x86 Existing memory checking tools (e.g. Valgrind) slow many false positives

A TRACE - BASED J AVA JIT COMPILER RETROFITTED FROM A METHOD - BASED COMPILER H IROSHI I NOUE, H IROSHIGE H AYASHIZAKI, P ENG W U, T OSHIO N AKATANI (IBM) Extend the compilation scope from methods to traces Traces span multiple method invocations More powerful than method inlining

A TRACE - BASED J AVA JIT COMPILER RETROFITTED FROM A METHOD - BASED COMPILER H IROSHI I NOUE, H IROSHIGE H AYASHIZAKI, P ENG W U, T OSHIO N AKATANI (IBM) Claim: current trace-JITs are immature Keep the advanced optimization infrastructure by retrofitting

P HASE - BASED T UNING FOR B ETTER U TILIZATION OF P ERFORMANCE -A SYMMETRIC M ULTICORE P ROCESSORS T YLER S ONDAG AND H RIDESH R AJAN Objective Design and apply a transparent and fully-automatic process called phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores Motivation Trend towards performance asymmetry among cores of a single chip Method Statically partition the application into code sections that are likely to have similar runtime behavior. Exhibited runtime characteristics of representative sections are used to map the whole cluster Results 36% average process speedup with negligible overheads

Phase-based tuning P HASE - BASED T UNING FOR B ETTER U TILIZATION OF P ERFORMANCE -A SYMMETRIC M ULTICORE P ROCESSORS

V APOR SIMD: A UTO -V ECTORIZE O NCE, R UN E VERYWHERE D ORIT N UZMAN, S ERGEI D YSHEL, E RVEN R OHOU, I RA R OZEN, A LBERT C OHEN, A YAL Z AKS Objective Design and a split vectorization framework and study how it compares to monolithic one Motivation JIT compiler technology offers portability while facilitating target – and context-specific specialization; SIMD hardware is ubiquitous and diverse Method Mix-and-match existing open compilation tools, namely GCC and MONO Results Comparable to specialized monolithic offline compilers

Vectorizing for different platforms V APOR SIMD: A UTO -V ECTORIZE O NCE, R UN E VERYWHERE

Split vectorization scheme V APOR SIMD: A UTO -V ECTORIZE O NCE, R UN E VERYWHERE

Interoparable compilation flows V APOR SIMD: A UTO -V ECTORIZE O NCE, R UN E VERYWHERE

WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.

Similar presentations

Presentation on theme: "WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.

Similar presentations

Presentation on theme: "WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9."— Presentation transcript:

Similar presentations

About project

Feedback