Xtensa C and C++ Compiler Ding-Kai Chen

Slides:



Advertisements
Similar presentations
Chapter 11 Introduction to Programming in C
Advertisements

Configuration management
SOC Design: From System to Transistor
Adding custom instructions to Simplescalar/GCC architecture
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Introducing the ConnX D2 DSP Engine
Feb 2013 Jerry Redington Principal System Architect
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Program Development Tools The GNU (GNU’s Not Unix) Toolchain The GNU toolchain has played a vital role in the development of the Linux kernel, BSD, and.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Advanced Processor Architectures for Embedded Systems Witawas Srisa-an CSCE 496: Embedded Systems Design and Implementation.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Platforms, ASIPs and LISATek Federico Angiolini DEIS Università di Bologna.
2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
Synthesis of Custom Processors based on Extensible Platforms Fei Sun +, Srivaths Ravi ++, Anand Raghunathan ++ and Niraj K. Jha + + : Dept. of Electrical.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Verification of Configurable Processor Cores Marines Puig-Medina, Gulbin Ezer, Pavlos Konas Design Automation Conference, 2000 Page(s): 426~431 presenter:
Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Embedded Systems Programming
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training course describes how to configure the the C/C++ compiler options.
1 3-Software Design Basics in Embedded Systems. 2 Development Environment Development processor  The processor on which we write and debug our programs.
Natawut NupairojAssembly Language1 Introduction to Assembly Programming.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
RM2D Let’s write our FIRST basic SPIN program!. The Labs that follow in this Module are designed to teach the following; Turn an LED on – assigning I/O.
Multimedia Teaching Tool SimArch V1.0 Faculty of Electronic Engineering University of Nis Serbia.
11 Using SPIRIT for describing systems to debuggers DSDP meeting February 2006 Hobson Bullman – Engineering Manager Anthony Berent – Debugger Architect.
CASTNESS11, Rome Italy © 2011 Target Compiler Technologies L 1 Ideas for the design of an ASIP for LQCD Target Compiler Technologies CASTNESS’11, Rome,
Automated Design of Custom Architecture Tulika Mitra
Xilinx Programmable Logic Design Solutions Version 2.1i Designing the Industry’s First 2 Million Gate FPGA Drop-In 64 Bit / 66 MHz PCI Design.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Renesas Technology America Inc. 1 SKP8CMINI Tutorial 2 Creating A New Project Using HEW.
Configurable, reconfigurable, and run-time reconfigurable computing.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
SimArch: Work in Progress Multimedia Teaching Tool Faculty of Electronic Engineering University of Nis Serbia.
Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.
Developing software and hardware in parallel Vladimir Rubanov ISP RAS.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.
Describing target hardware in debuggers Aaron Spear DEBUG TECHNOLOGIES ARCHITECT ACCELERATED TECHNOLOGY DIVISION Feb 2006 DSDP Meeting/Toronto.
Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. G. Fettweis chair HW/SW Co-design Praktikum Erik Fischer & Emil Matúš
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
An Overview of Hardware Design Methodology Ian Mitchelle De Vera.
Test Specifications A Specification System for Multi-Platform Test Suite Configuration, Build, and Execution Greg Cooksey.
Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
May 2013 Tensilica Overview. Copyright © 2013, Tensilica, Inc. All rights reserved. 2 Tensilica At a Glance Market Focus Mobile wireless and Infotainment.
The Engine of SOC Design Korea – an Important Market Antonio J. Viana Sr. VP of Worldwide Sales.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
Software and Communication Driver, for Multimedia analyzing tools on the CEVA-X Platform. June 2007 Arik Caspi Eyal Gabay.
Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof
ECE 551: Digital System Design & Synthesis
Many-core Software Development Platforms
Chris Savarese, Yashesh Shroff, Greg Lawrence
Dynamically Reconfigurable Architectures: An Overview
Matlab as a Development Environment for FPGA Design
Performance Optimization for Embedded Software
Compiler Back End Panel
Compiler Back End Panel
To DSP or Not to DSP? Chad Erven.
Mapping DSP algorithms to a general purpose out-of-order processor
Presentation transcript:

Xtensa C and C++ Compiler Ding-Kai Chen Tensilica, Inc dkchen@tensilica.com

Presentation Outline XCC history XCC target -- Xtensa configurable processor XCC details with examples User defined C types Operator overloading VLIW scheduling Auto-SIMD vectorization Operation fusion SWP Changes

XCC History Got the first version of SGI Pro64 in May 2000 First customer release, August 2001 Release with IPA, August 2002 Release with SWP, Feedback, VLIW, September 2004 Release with GCC 4.2 Front End, October 2009 Supports C and C++ applications Other languages are not as important for embedded applications

Xtensa Core Architecture Xtensa Processor 32-bit RISC processor targeting embedded dataplane applications 16 32-bit general registers (AR)‏ 24-bit base instructions Configurable at design-time (not at run-time)‏ Xtensa Core Architecture

Xtensa Configuration Options Many pre-defined options to choose from Endianness Windowed vs non-windowed register file Narrow (16-bit) instructions Multipliers Coprocessors (HiFi, Vectra, BBE, FP) Specialized (e.g., MAX) instructions, etc Configuration Options Xtensa Core Architecture

Targeting XCC to Base Xtensa and Tensilica Configurations As part of retargeting to Xtensa, we used/added Code-generator generator tool Olive for WHIRL to CGIR translation Handles a lot of configuration specific code Support for Xtensa zero-overhead loop instructions CG Code-size optimization that commonizes instructions from control-flow predecessors Feedback-directed speed vs code-size tradeoff Support for flexible VLIW formats Formats of different bit width and different number of issue slots

Tensilica Instruction Extension (TIE)‏ TIE is a language to describe new custom: Register files up to 512 bits wide Instructions up to 128 bits VLIW formats up to 15 slots C types mapped to custom register files Vectorization rules Fusion patterns Operator overloading Custom TIE Configuration Options Xtensa Architecture

XCC Challenges Custom extensions in TIE are written at customer site and cannot be configured at XCC build time Design goals: Separation of config-independent code and config-dependent libraries Re-targeting in minutes after TIE is designed or modified by processor architect at customer site programming new HW extensions as native C types/operations

Xtensa - Full Development Automation Complete Hardware Design Source pre-verified RTL, EDA scripts, test suite Processor Extensions Processor Configuration Use standard ASIC/COT design techniques and libraries for any IC fabrication process Xtensa Processor Generator* … the high level flow is simple … Business: - A few highly skilled engineers required for custom - Higher productivity, lower cost - Focus on HW differentiation with tools support Technical: - Easy to customize with check-boxes or TIE - Apply skilled engineering to differentiation - Use the tools to make more custom HW … with more detail about what the XPG produces … in minutes! Iterate… 1. Select from menu 2. Explicit instruction description (TIE) Customized Software Tools C/C++ compiler Debuggers, Simulators, RTOSes * US Patent: 6,477,697 9

TIE register file and operation // new register file for int32x4 // vectorization Regfile v 128 16 // a new C type based on <v> regfile // and has 128-bit size and // 128-bit alignment ctype int32x4 128 128 v operation add_v { out v vout, in v va, in v vb } {} { assign vout = { va[127:96] + vb[127:96], va[95:64] + vb[95:64], va[63:32] + vb[63:32], va[31:0] + vb[31:0] }; } in C: void vsum() { int i; int32x4* va = (int32x4*)a; int32x4* vb = (int32x4*)b; int32x4* vc = (int32x4*)c; for (i=0; i<VSIZE; i++) { // C intrinsic call vc[i] = add_v(va[i] , vb[i]); } add_v is an intrinsic call in C In WHIRL, it is an intrinsic_op  optimizer friendly

TIE C type support Each TIE C type maps to a new WHIRL mtype Each TIE regfile maps to a ISA_REGCLASS GCC FE declares new C types and new intrinsics (added new TIE_TYPE tree code)‏ WGEN translates TIE C type references to WHIRL loads/stores Olive tool adds dynamic rules to handle new types and WHIRL opcodes Added TN_mtype() for register spills/reloads Made BE optimizations (CSE, ebo, etc) work

TIE example – generated code #<loop> Loop body line 28, nesting depth: 1, iterations: 8 #<loop> unrolled 4 times load_v v0,a2,0 # [0*II+0] id:20 b+0x0 load_v v1,a3,0 # [0*II+1] id:19 a+0x0 load_v v2,a2,16 # [0*II+2] id:20 b+0x0 load_v v3,a3,16 # [0*II+3] id:19 a+0x0 load_v v4,a2,32 # [0*II+4] id:20 b+0x0 load_v v5,a3,32 # [0*II+5] id:19 a+0x0 load_v v6,a2,48 # [0*II+6] id:20 b+0x0 load_v v7,a3,48 # [0*II+7] id:19 a+0x0 addi a2,a2,64 # [0*II+8] addi a3,a3,64 # [0*II+9] addi a4,a4,64 # [0*II+10] add_v v0,v1,v0 # [0*II+11] add_v v1,v3,v2 # [0*II+12] add_v v2,v5,v4 # [0*II+13] add_v v3,v7,v6 # [0*II+14] store_v v0,a4,-64 # [0*II+15] id:21 c+0x0 store_v v1,a4,-48 # [0*II+16] id:21 c+0x0 store_v v2,a4,-32 # [0*II+17] id:21 c+0x0 store_v v3,a4,-16 # [0*II+18] id:21 c+0x0 Total 19/4 = 4.75 cycles per iteration

TIE updating ld/st // pre-increment load/store operation load_vu { out v vout, inout AR base, in simm8 offset } { out VAddr, in MemDataIn128 } { assign VAddr = base + offset; assign vout = MemDataIn128; assign base = base + offset; } operation store_vu { in v vin, inout AR base, in simm8 offset } { out VAddr, out MemDataOut128 } { assign MemDataOut128 = vin; proto int32x4_loadiu { out int32x4 vout, inout int32x4* base, in immediate offset } {} { load_vu vout, base, offset; proto int32x4_storeiu { in int32x4 vin, inout int32x4* base, in immediate offset } {} { store_vu vin, base, offset;

TIE updating ld/st XCC Identifies updating ld/st operations Pre-bias ld/st bases to work with pre-increment Combine ld/st with addi in CG #<loop> Loop body line 28, nesting depth: 1, iterations: 32 load_vu v0,a2,16 # [0*II+0] id:20 b+0x0 load_vu v1,a3,16 # [0*II+1] id:19 a+0x0 store_vu v2,a4,16 # [1*II+2] id:21 c+0x0 add_v v2,v1,v0 # [0*II+3] total 4 cycles per iteration

TIE operator overloading Check for TIE type operands and operator overloading in build_binary_op in c-typeck.c of GCC Build proper call to mapped TIE intrinsic // map “+” operator to add_v for // type int32x4 operator "+" add_v in C: void vsum_op() { int i; int32x4* va = (int32x4*)a; int32x4* vb = (int32x4*)b; int32x4* vc = (int32x4*)c; for (i=0; i<VSIZE; i++) { // more natural using C “+” syntax vc[i] = va[i] + vb[i]; }

TIE VLIW scheduling format flix0 64 {slot0,slot1} // add 2-slots 64-bit VLIW format slot_opcodes slot0 { load_v, store_v, load_vu, store_vu, add_v } slot_opcodes slot1 { load_v, store_v, load_vu, store_vu, add_v } ---------------------------------- .s output -------------------------------------------------- #<loop> unrolled 2 times { # format flix0 load_vu v3,a2,32 # [0*II+0] id:20 b+0x0 add_v v5,v4,v3 # [1*II+0] } load_v v0,a2,-16 # [0*II+1] id:20 b+0x0 add_v v2,v1,v0 # [1*II+1] load_v v1,a3,16 # [0*II+2] id:19 a+0x0 load_vu v4,a3,32 # [0*II+2] id:19 a+0x0 store_v v2,a4,16 # [1*II+3] id:21 c+0x0 store_vu v5,a4,32 # [1*II+3] id:21 c+0x0 total 4/2=2 cycles per iteration

TIE VLIW scheduling XCC initialization includes analysis on TIE VLIW formats Create resources that model bundling constraints Consider a simpler case: 1 slot is allowed for each opcode Each VLIW slot in a format is viewed as a resource Different formats are treated separately Each opcode consumes the resource of the slot it is allowed For a group of operations, if the total resource usage is within the limit  can be scheduled in the same cycle Get complicated when multiple slots are allowed for opcodes Resource reservation modeling allows de-coupling of scheduling and slot assignment in CG Extended resource reservation word type SI_RRW to arbitrary length bit-vectors TI_RES_RES_Resources_Available() also checks for compatible formats

TIE auto-SIMD vectorization property vector_ctype {int32x4, int32, 4} property vector_proto {add_v, xt_add, 4} in C: for (i=0; i<SIZE; i++) { c[i] = a[i] + b[i]; } with -O3 -LNO:simd -clist, in .w2c: int32x4 V_00; int32x4 V_; int32x4 V_0; int32x4 V_4; _INT32 i; for(i = 0; i <= 127; i = i + 4)‏ { V_00 = *(int32x4 *)(&a[i]); V_ = *(int32x4 *)(&b[i]); V_0 = add_v(V_00, V_); V_4 = V_0; * (int32x4 *)(&c[i]) = V_4; }

TIE auto-SIMD vectorization Developed independently (before) Open64 Vectorizer Integrate into Phase2 of LNO Scan all loops in a nest Check for presence of vectorized versions of each op in the loop Check for stride-1 or invariant memory references Support for loads and stores with addresses not aligned as vector type Pre-load once before the vector loop Subsequent loads in the vector loop combine with the prior loads Support for spatial reuse within a vector using select instruction E.g. a[i] + a[i+1] in the scalar loop Only a single load is needed now for each iteration Select instructions shuffle data from loads of consecutive iterations

TIE operation fusion Combine multiple operations to one imap add_shift_v { out v vout, in v va, in v vb, in immediate amount } { {} { // the output pattern add_shift_v vout, va, vb, amount; } { { v v_temp } { // the input pattern add_v v_temp, va, vb; shift_v vout, v_temp, amount; Combine multiple operations to one E.g., combines an add followed by a shift to one add_shift operation Performed in CG Build dataflow graphs from input patterns Repeatedly search for matches in BBs Peephole optimization with custom patterns

TIE operation fusion Example C code: for (i=0; i<VSIZE; i++) { vc[i] = (va[i] + vb[i]) << 2; } Original schedule is 5 cycles / 2 iter = 2.5 cycles per iteration New schedule with operation fusion is 4 cycles / 2 iter = 2 cycles per iteration

XCC SWP scheduler Xtensa has no rotating registers – added 2 register allocators, simple and coloring. Use simple first to get tighter bound then try coloring. Performance is critical: added back-tracking for the following Unrolling (hard to guess best unrolling) Different priority heuristics for choosing candidates Different initial op orderings Register allocation failures Runs slightly longer but complements the original IA-64 based SWP algorithm well

Conclusion Open64 is versatile in providing optimized performance for embedded applications. XCC experience shows that many of the optimizations can be adapted to retarget for ISA extensions quickly. Sample Performance Data: EEMBC Consumer benchmark gained 6x speedup with automatic vectorization + vliw scheduling + operation fusion XCC solution is not final. It is still evolving with new HW features offered from Tensilica. Want to explore new ways in TIE to describe HW that supports optimizations.

Tensilica is looking for new talent to join the compiler team. http://www.tensilica.com dkchen@tensilica.com