Download presentation
Presentation is loading. Please wait.
Published byLionel Lester Modified over 8 years ago
1
Renesas Electronics America Inc. © 2010 Renesas Electronics America Inc. All rights reserved. ID 130L: Optimizing your SH2A Application Kevin P King Senior Staff Applications Engineer 14 October 2010 Version 1.2
2
2 © 2010 Renesas Electronics America Inc. All rights reserved. Kevin P King Education Electrical Engineering, University of Lowell (Edward B Van Dusen Award for Academic Achievement) Thirty years of Embedded Design Experience (x86, HC05, HC11, 8051, Philips XA, Atmel AVR, Hitachi, Mitsubishi, etc.... Five years of Emulator design for MetaLink COP8, 68HC05, 68HC11, 8051 (multi-vendors), National CR16, Hitachi H8/500, etc... Multiple Quality Awards for Embedded Software & Hardware Development. Specialty is Embedded System Design - MCU firmware & hardware Senior Staff Application Engineer Primary Tech Support for SH2A Focusing on Medical Segment and SH Family
3
3 © 2010 Renesas Electronics America Inc. All rights reserved. Renesas Technology and Solution Portfolio Microcontrollers & Microprocessors #1 Market share worldwide * Analog and Power Devices #1 Market share in low-voltage MOSFET** Solutions for Innovation ASIC, ASSP & Memory Advanced and proven technologies * MCU: 31% revenue basis from Gartner "Semiconductor Applications Worldwide Annual Market Share: Database" 25 March 2010 **Power MOSFET: 17.1% on unit basis from Marketing Eye 2009 (17.1% on unit basis).
4
4 © 2010 Renesas Electronics America Inc. All rights reserved. 4 Renesas Technology and Solution Portfolio Microcontrollers & Microprocessors #1 Market share worldwide * Analog and Power Devices #1 Market share in low-voltage MOSFET** ASIC, ASSP & Memory Advanced and proven technologies * MCU: 31% revenue basis from Gartner "Semiconductor Applications Worldwide Annual Market Share: Database" 25 March 2010 **Power MOSFET: 17.1% on unit basis from Marketing Eye 2009 (17.1% on unit basis). Solutions for Innovation
5
5 © 2010 Renesas Electronics America Inc. All rights reserved. 5 Microcontroller and Microprocessor Line-up Superscalar, MMU, Multimedia Up to 1200 DMIPS, 45, 65 & 90nm process Video and audio processing on Linux Server, Industrial & Automotive Up to 500 DMIPS, 150 & 90nm process 600uA/MHz, 1.5 uA standby Medical, Automotive & Industrial Legacy Cores Next-generation migration to RX High Performance CPU, FPU, DSC Embedded Security Up to 10 DMIPS, 130nm process 350 uA/MHz, 1uA standby Capacitive touch Up to 25 DMIPS, 150nm process 190 uA/MHz, 0.3uA standby Application-specific integration Up to 25 DMIPS, 180, 90nm process 1mA/MHz, 100uA standby Crypto engine, Hardware security Up to 165 DMIPS, 90nm process 500uA/MHz, 2.5 uA standby Ethernet, CAN, USB, Motor Control, TFT Display High Performance CPU, Low Power Ultra Low Power General Purpose
6
6 © 2010 Renesas Electronics America Inc. All rights reserved. 6 Microcontroller and Microprocessor Line-up Superscalar, MMU, Multimedia Up to 1200 DMIPS, 45, 65 & 90nm process Video and audio processing on Linux Server, Industrial & Automotive Up to 500 DMIPS, 150 & 90nm process 600uA/MHz, 1.5 uA standby Medical, Automotive & Industrial Legacy Cores Next-generation migration to RX High Performance CPU, FPU, DSC Embedded Security Up to 10 DMIPS, 130nm process 350 uA/MHz, 1uA standby Capacitive touch Up to 25 DMIPS, 150nm process 190 uA/MHz, 0.3uA standby Application-specific integration Up to 25 DMIPS, 180, 90nm process 1mA/MHz, 100uA standby Crypto engine, Hardware security Up to 165 DMIPS, 90nm process 500uA/MHz, 2.5 uA standby Ethernet, CAN, USB, Motor Control, TFT Display High Performance CPU, Low Power Ultra Low Power General Purpose SuperH
7
7 © 2010 Renesas Electronics America Inc. All rights reserved. Innovation Engine Control Unit What used to take multi-MCU or MCU + DSP can now be done by a single MCU!
8
8 © 2010 Renesas Electronics America Inc. All rights reserved. Position Renesas provides the tools that allow you to use the Superscalar Architecture to realize System performance that in the past required dual processor designs containing both an MCU and a DSP. nop;
9
9 © 2010 Renesas Electronics America Inc. All rights reserved. Agenda Short Architecture review Default Optimization choices (what do they do) Delayed Branching and Delay Slot usage FPU code controls In-lining code Misc TBR usage Section Control Inline assembly code (Optional)
10
© 2010 Renesas Electronics America Inc. All rights reserved. 10 SH2/SH2A Architecture* *SH Core training is available on RenesasInteractive
11
11 © 2010 Renesas Electronics America Inc. All rights reserved. Super Scalar versus Dual Core Scalar – One Thread/One Instruction at a time Single Instruction Stream/Single Pipeline – Fetch, Decode Execute Super Scalar – One Thread / multiple instructions at a time For SH2A - 2 FETCH, 2 DECODE, 2 EXECUTE Dual Core – 4 instructions at a time / 2 independent threads
12
12 © 2010 Renesas Electronics America Inc. All rights reserved. SH-2A Features: Superscalar Pipeline / Floating Point Unit 5 Stages SH-2A-FPU CPU Core only CPU FPU 12345 Pipeline Superscalar
13
13 © 2010 Renesas Electronics America Inc. All rights reserved. Register Set – SH2 SH2A
14
14 © 2010 Renesas Electronics America Inc. All rights reserved. SH-2A Register Banks Regbank settings Disabled All Ints Banked Banked by Priority Two new interrupts Bank Overflow Bank Underflow New HEW Window CPU Register Banks
15
15 © 2010 Renesas Electronics America Inc. All rights reserved. SH-2A Register Banks Regbank settings Disabled All Ints Banked Banked by Priority Two new interrupts Bank Overflow Bank Underflow New HEW Window CPU Register Banks
16
16 © 2010 Renesas Electronics America Inc. All rights reserved. SH-2A Fast Interrupt Response CPU Latency Save Context (By Complier) User Code Restore Context Typical MCUs INT Trigger Latency 9 cycles SH-2A MCU 9 Cycles CPU Latency + Save Context User Code Restore Context 15 Reg. Banks LIFO HW saves the context in register bank LIFO One Primary Reg. Bank +
17
17 © 2010 Renesas Electronics America Inc. All rights reserved. QUESTION? Register banking simplifies/speeds my ISR context switch when using the FPU? (be careful with your answer) Yes, the register banking always helps you context switch, however FPU registers are not banked an thus must be saved on the stack if they are used in the ISR.
18
© 2010 Renesas Electronics America Inc. All rights reserved. 18 FPU review* * Full SH2A-FPU training available at RenesasInteractive
19
19 © 2010 Renesas Electronics America Inc. All rights reserved. FPU Registers Load/Store Integer through the FPUL register 16 Single Precision Registers FPR0-FPR15 8 Double precision DR0-DR14 (use even numbers) Created by concatenating 2 FP registers Configured in Software by MCU FPSCR.SZ controls transfer size FPSCR.PR controls precision FPUL FPSCR 31 0
20
20 © 2010 Renesas Electronics America Inc. All rights reserved. Pop Quiz: The SH2A-FPU core can handle (Choose BEST answer): a)Single precision b)Double precision c)Both single and double d)Both single and double, but requires run time configuration changes if using mixed precisions in your code e)None of the above d – The FPU can handle both, but it must switch between modes if your code contains “mixed” arithmetic. We will examine this in the lab so you can get optimal code when doing floating point.
21
21 © 2010 Renesas Electronics America Inc. All rights reserved. QUESTION: The SH2A-FPU core is a load store architecture. In order to get information into and out of the Floating Point registers you must go through the FPUL (floating point communications registers). TRUE FALSE FALSE – You only need the FPUL to communicate from the Integer registers to the Floating point register, thus you are only “penalized” you if you do a lot of Integer Float. Floating point data may be moved directly between memory and FPU registers.
22
22 © 2010 Renesas Electronics America Inc. All rights reserved. SH2A Bus Structure SH-2A CPU (Superscalar) On-chip RAM F bus (instruction) M bus (data) 32bit/1cyc Cache controller I bus (internal bus) 32bit/1cyc DMAC/DTC Bus State Controller External bus Bridge P bus (peripheral bus) TimersADCSCIPORT 32bit/1cyc 16bit/3cyc Cache memory Instruction/Data cache: 8KB/8KB 4way set associative (LRU) On-chip Flash SDRAM, SRAM, etc... I/F FPU Harvard Architecture
23
23 © 2010 Renesas Electronics America Inc. All rights reserved. Example: SH2A SRAM Connection Details Multiple connections to I, M and F Bus Independent read/write ports Priority I, M then F (in case of multiple access to same page*) * for example DMAC + CPU
24
24 © 2010 Renesas Electronics America Inc. All rights reserved. Multi-page RAM Access Conflict when accessing same page No conflict when different pages RAM
25
25 © 2010 Renesas Electronics America Inc. All rights reserved. Questions before we start the lab?
26
26 © 2010 Renesas Electronics America Inc. All rights reserved. Start the Lab Keep your dice turned to the section of the lab you are on. (Instructions are provided in the lab handout) Please refer to the Lab Handout and let’s get started!
27
27 © 2010 Renesas Electronics America Inc. All rights reserved. Checking Progress We are using the die to keep track of where everyone is in the lab. Make sure to update it as you change sections. When done with the lab, your die will have the 6 pointing up as shown here.
28
28 © 2010 Renesas Electronics America Inc. All rights reserved. Questions Section 1: 1.1 No, Debug is slowest, this surprises most people 1.2 Debug setting does NOT use Delay Slots. This allows for “sequential code execution” for easy debug. 1.3 Speed uses the delay slot after the branch to cut the loop iteration in half. RULE #1: Let the compiler do its Job! SH2A gets maximum performance when the compiler is allowed to re- order code to avoid pipeline stalls and use the delay slot, which might normally be a wasted fetch and decode.
29
29 © 2010 Renesas Electronics America Inc. All rights reserved. Questions Section 1: No optimization Delay slot used But loop count still 4
30
30 © 2010 Renesas Electronics America Inc. All rights reserved. Questions Section 1: Speed Delay Slot used And FMAC duplicated to Cut loop count in half
31
31 © 2010 Renesas Electronics America Inc. All rights reserved. Questions Section 2: 2.1 NO, has some extra code 120 result = factorial((unsigned int)3); 000014A4 D7A6 MOV.L @(H'0298:8,PC),R7 000014A6 2F76 MOV.L R7,@-R15 000014A8 086A STS FPSCR,R8 000014AA E403 MOV #H'03,R4 000014AC 28E9 AND R14,R8 000014AE B17D BSR @_factorial:12 000014B0 486A LDS R8,FPSCR 000014B2 7F04 ADD #H'04,R15 Must change mode of FPU because you used Mixed
32
32 © 2010 Renesas Electronics America Inc. All rights reserved. Questions Section 2: 2.2 Looks more like what you would expect 2.3 You should see about 40nS savings in performance in “safe mode” when doing function calls RULE #2: Decide your math requirements up front! If possible choose Single or Double precision. If Not, use safe mode and take the minimal hit when you do need to do Double precision.
33
33 © 2010 Renesas Electronics America Inc. All rights reserved. Questions Section 2: 2.4 50%, 2 byte, half, single instruction call 2.5 No, because they were already single instruction calls 95 HardwareSetup(); // Use Hardware Setup 00000814 8300 JSR/N @@(H'0000:8,TBR) Using TBR 95 HardwareSetup(); // Use Hardware Setup 00000816 D70C MOV.L @(H'0030:8,PC),R7 00000818 474B JSR/N @R7 Not Using TBR Seem small, but think about how many calls in your code!
34
34 © 2010 Renesas Electronics America Inc. All rights reserved. Questions Section 3: Performance should get progressively better Number of registers saved decreases with each “optimization” Inlining the code rather than function calls, the emulator can see the registers is needs to use and thus save. RULE #3: Reduce your interrupt overhead Use regbanking where possible. Use inline code in you ISR to save the pushing/popping of FPU register set.
35
35 © 2010 Renesas Electronics America Inc. All rights reserved. Questions Section 4: Nothing in source indicates in-lined code
36
36 © 2010 Renesas Electronics America Inc. All rights reserved. Questions Section 4: 4.2 You go to the function, but it is really at the same PC range where you came from in main 4.3 Just show_simple. We basically told it that start_timer could not be inlined. 4.44.5 See HINT. We still had show_simple and show_addressing inlined. Be careful when you selected detailed optimization it does not change with the “global settings” 4.6 Yes RULE #4: In-lining can be used to keep you code “pretty” while making it run faster by reducing function calls. Control the Implicit and Explicit inlineing to take advantage of tiny speed improvements. You code still looks logically like you intended.
37
37 © 2010 Renesas Electronics America Inc. All rights reserved. Questions Section 5: 5.1 You should have seen about 100nS of savings. RULE #5: When possible take advantage of “free time”. By understanding where your high-frequency access to buffers may be, simple control of their location give you performance enhancements without changing code functionality at all.
38
38 © 2010 Renesas Electronics America Inc. All rights reserved. Questions Section 6: 6.1 Sort of tongue in cheek, lots of errors of course. 6.2 On main_variable. RULE #6: When using assembler code, be aware of the variables you may want to “watch”. Some type information is lost by generating.src file and then obj. This is probably even worse for complex structures.
39
39 © 2010 Renesas Electronics America Inc. All rights reserved. Lab Summary Review Default Optimization choices Delayed Branching and Delay Slot usage FPU code controls Misc TBR usage Section Control In-lining code In-line assembly code Rev. 1.00DateMeeting Title * This will be repeated many times in the lab
40
40 © 2010 Renesas Electronics America Inc. All rights reserved. Innovation Engine Control Unit What used to take multi-MCU or MCU + DSP can now be done by a single MCU!
41
© 2010 Renesas Electronics America Inc. All rights reserved. 41 Thank You!
42
© 2010 Renesas Electronics America Inc. All rights reserved. 42 Appendix: Additional Information
43
43 © 2010 Renesas Electronics America Inc. All rights reserved. FPU Load/Store operation float show_simple(int x, int y) { return(x*y); } MOV R5,R0 MULR R0,R4 LDS R4,FPUL RTS FLOAT FPUL,FR0 Multiply passed parameters Move result to FPU communications register Convert to float
44
44 © 2010 Renesas Electronics America Inc. All rights reserved. FPU Load/Store operation - Float FPUL FPSCR 31 0 MOV R5,R0 MULR R0,R4 LDS R4,FPUL RTS FLOAT FPUL,FR0 0x10 0x100 2.56e03
45
45 © 2010 Renesas Electronics America Inc. All rights reserved. FPU Load/Store operation - Double FPUL FPSCR 31 0 0x10 0x100 2.56e03
46
Renesas Electronics America Inc.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.