Presentation is loading. Please wait.

Presentation is loading. Please wait.

Designing and Optimizing Software for Intel® Architecture Multi-core Processors Peter van der Veen QNX Software Systems.

Similar presentations

Presentation on theme: "Designing and Optimizing Software for Intel® Architecture Multi-core Processors Peter van der Veen QNX Software Systems."— Presentation transcript:

1 Designing and Optimizing Software for Intel® Architecture Multi-core Processors Peter van der Veen QNX Software Systems

2 Overview  Software and system vendors continue to add features and capabilities that demand more and more CPU performance  Microprocessor vendors can no longer scale performance simply by increasing clock speed ► Thermal considerations ► Design complexity  Trend to include multiple processor cores on a single die  Multi-core designs address performance issues ► Favorable power / performance ratio for embedded systems ► Decreased board area  Companies that can leverage the full capabilities of hardware can achieve a competitive advantage CPU Bridge CPU Bridge CPU Bridge CPU Bridge CPU

3 Multi-core Architectures  Increased integration on die ► Multiple CPU cores and caches ► High speed, on-chip system interconnect  Greatly reduces latency associated with a traditional board-level interconnect  Memory controller(s) on system bus ► Allows separation of memory for asymmetric operation  On-chip peripherals on system bus ► Maximizes peripheral throughput ► Reduces latency CPU System Interconnect Cache I/O Memory Controller Single Die

4 Intel Evolution of Parallelism AS Architectural State: registers, flags, timestamp counter, etc. APICAdvanced Programmable Interrupt Controller PERProcessor Execution Resources: caches, execution units, instruction decode, bus interface etc. One die ProcessorExecutionResources ArchitecturalState Interrupt Cntlr (APIC) Classic Uniprocessor One die AS APIC AS APIC ProcessorExecutionResources Hyper-Threading Technology* (HT Technology) One die AS APIC AS APIC PERPER L2 Cache & Bus Interface Multi-coreClassic SMP Chipset AS APIC PER AS APIC PER All of these forms of parallelism are in use today * Hyper-Threading Technology (HT Technology) requires a computer system with an Intel® Processor supporting HT Technology and an HT Technology enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See for more information including details on which processors support HT Symmetrical Multi-Processing (SMP) with Multi-core Chipset AS APIC AS APIC PERPER Bus Interface AS APIC AS APIC PERPER

5 QNX and Multi-core  QNX has done the heavy lifting to enable migration to multi-core ► Let developers focus on product differentiation  Reliable, proven support for multi-core applications ► 1997: Industry’s first to bring SMP to embedded ► 1984: High performance, transparent distributed messaging ► Full support for asymmetric and symmetric multiprocessing ► Linux, vxWorks interoperability  Migrate existing software base and enable new multi-core optimized applications  Multi-core capable tool suite  World class professional services and expert training  Active role in developing standards through Multi-core Exchange consortium ► Enable portability of applications across various platforms ► Derive common set of APIs that multi-core development tools can utilize to support interoperability

6 Microkernel Architecture File System Process Manager Protocol Stack Microkernel Application Microkernel is the only trusted component Audio Driver Graphics Driver Applications and drivers  Are processes that plug into a message bus  Reside in memory-protected address space  Cannot corrupt other software components  Can be started, stopped, and upgraded on the fly Message Bus …

7 Multiprocessing Models  Two cores, two OSs  Same (homogeneous) or different (heterogeneous) OS CPU OS 2OS 1 AsymmetricSymmetric  Two cores, one OS Single OS Instance CPU

8 Asymmetric Processing  Asymmetric Model Pros: ► Only possible mode when different OSs are running ► CPU core can be dedicated to specific applications ► One possible mode for applications that cannot operate with parallel processing  Asymmetric Model Cons: ► Resource sharing / arbitration needs to be designed into system by developers  Neither OS “owns” the whole system  Memory, I/O, interrupts are shared  Evolution - complexity increases as cores are added  Static configuration, difficult to add dynamic resourcing  Time to market?  Any HW contention must be dealt with by designer ► Synchronization between cores done through application level messages  Sub-optimal performance  Complexity of the problem is not linear  Addition of cores may require re-architecting application to increase performance CPU System Interconnect Cache I/O Memory Controller I/O Shared Memory OS 1 Memory OS 2 Memory OS 2OS 1 Applications Managing shared resources complicates design

9 fd = open(“/dev/ffs1”,…); write(fd, …); Message Bridge (Ethernet, RapidIO,Shared Memory) Flash File System Database Application Microkernel Core 1 Message Queues Networking Stack Flash File System Application Microkernel Core 0 Internet Message-Passing Bus Neutrino Homogeneous AMP Transparent Distributed Processing  Extends message passing bus over a transport layer  Applications / services can be built in a fully distributed manner without special code ► Message queues ► File systems ► Hardware ports  Seamless sharing of I/O resources between cores (e.g. use a serial port “owned” by another core) fd = open(“/net/core0/dev/ffs1”,…); write(fd, …);

10 Symmetric Processing  Symmetric Model Pros: ► Highly scalable. Supports multiple processing cores seamlessly without code modification ► One OS “sees all” and handles all resource sharing / arbitration issues ► Dynamic load balancing handles processing bursts with OS thread scheduling ► Dynamic memory allocation = all cores can draw on full pool of available memory without penalty ► High performance inter-core messaging synchronization  Core-to-core synchronization using OS primitives ► System wide statistics / information gathering capability for performance optimizations, debugging, etc.  Symmetric Model Cons : ► Load balancing is dynamic and application may require dedicated CPU ► Applications with poor synchronization among threads may not work properly  Difficult to change software  3 rd party software CPU System Interconnect Cache I/O Memory Controller I/O Memory OS Applications

11 Multi-core Scaling Software  QNX conforms to POSIX (Portable Operating System Interface) Application Programming Interface ► Allows straightforward porting of code from one OS to another that is also conformant  Application broken down into memory protected units called processes  Processes further divided into internal, schedulable units called threads ► Threads share all of the same resources (memory space included)  PROCESSES run on individual cores concurrently in asymmetric mode (all threads for a process are tied to one core)  THREADS run on individual cores concurrently in symmetric operation Threads Process Threads Process Application

12 Thread Running Process Ready queues 255 Priority 254 253... 0 Thread Blocked states Thread Process Active Threads and Ready Queues SMP CPU 0 CPU 1

13 AMP or SMP?  Sometimes this can be a clear cut decision ► Two operating systems = AMP ► Application requires all available CPUs to maximize performance = SMP  What if the versatility of SMP is desired but the control of AMP is needed?

14 QNX Bound Multiprocessing CPU System Interconnect Cache I/O Memory Controller I/O Memory OS A2A1A5A3A4  Benefits of both AMP and SMP  Support legacy code base and multi-core optimized applications simultaneously ► Supports bound and symmetric operation, selectable by process / thread  Designer has full control over applications ► Applications and/or threads can be “bound” to a specific core ► Restrictive CPU usage as decided by designer  Load balancing ► OS dynamic or designer controlled ► Tools to optimize load balancing ► Resource sharing handled by OS  Single OS has full visibility and control ► Resource sharing handled by OS, simplifies design process ► System wide statistics / information gathering capability for performance optimizations & debugging  High Performance ► Kernel support for message passing and thread synchronization The Best of Both Worlds

15 Active Threads and Ready Queues: BMP Thread Running Process designated CPU 0 Ready queues 255 Priority 254 253... 0 Thread CPU 0 (Available) CPU 1 Scheduler Available CPU runs highest-priority CPU-designated thread Thread User controls which CPU will run a process’s threads. All threads in a process are tied to one CPU. Process designated CPU 1

16 Multiprocessing Summary Design ConsiderationSymmetricBoundAsymmetric Seamless Resource Sharing  Scalable beyond dual core  Legacy application operation ? Mixed OS environment  Dedicated processor by function  Inter-core messaging Fast (OS primitives) Fast (OS primitives) Slower (Application) Thread synchronization between cores  Load balancing  System wide debug and optimization 

17 The Transition to Multi-core The Role of Tools

18  The right toolset eases the transition to multi-core processors  Assess current software when moving to multi-core ► Should processes be separated between cores?  Determine how closely coupled the current processes are ► Where can concurrent processing help?  Show the current processing bottlenecks  Debugging in a multi-core environment ► Characterize and debug interaction between threads on multiple CPUs  Tuning and Optimization in a multi-core environment ► Move processes and threads between cores ► Examine processing bottle necks ► Examine inter-process communications

19 Microkernel Instrumented Kernel The instrumented kernel logs events which are filtered and stored into buffers which are captured and analyzed State changes Interrupts Process/thread creation System calls System Profiler Events On/Off filters Static event filters User defined filters E1E2E3E4E5E6 Event buffers Capture File Network

20 Thread / Process Coupling: QNX Momentics System Profiler Determine amount of messaging between processes.

21 Load Balancing: QNX Momentics System Profiler Measure CPU activity for all cores to determine optimal load balancing

22 Intel® C++ Compiler 8.1 for QNX Neutrino ® RTOS  Compiler based on classic Intel® C++ Compilers for desktop/server markets** ► Leverages mature Intel compiler technology ► Leads industry in supporting Intel Architecture’s performance features and *T technologies  Cross-compiler: ► From Windows to QNX Neutrino RTOS 6.3.0  Superior performance (see benchmarks)  Integrates into QNX Momentics* Development Suite  GCC C/C++ Object compatibility and interoperability Download free 30-day evaluation

23 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference or call (U.S.) 1-800-628-8686 or 1-916-356-3104 Configuration Info: Intel® C++ Compiler 8.1 for QNX Neutrino* RTOS, GCC 3.3.1 Intel® Pentium® 4 Processor, 3.0 GHz, 512 KB L2 Cache, 512MB Memory QNX Neutrino* RTOS 6.3 EEMBC 1.1 scores were not certified by ECL. Out-of-the-box performance was measured. Relative performance was computed by averaging relative performance on Automotive, Consumer, Networking, Office Automation, and Telecomm tests. EEMBC* 1.1 Intel® Pentium® 4 Processor (Embedded Microprocessor Benchmark Consortium*)

24 The Transition to Multi-core Software Architecture and Optimization

25 Optimizing Multi-core Applications  Reduce contention ► Minimize or remove core-core interactions to ensure most parallelism ► Ensure no serialization between competing tasks due to resource contention  Scale to number of available processors  Use system analysis tools to tune performance  Asymmetric operation ► Properly partition to produce desired CPU loading for each core  Symmetric operation ► Asymmetric application operation ► Thread affinity ► Bound Multiprocessing for dedicated CPU allocation  Select proper thread / process priorities to optimize real-time performance / CPU allocation

26  Original implementation  Lock contention and cache misses in forwarding table  Serializes Rx / Tx operations  No lock contention for FW table ► One table per CPU  Minimizes cache contention and snoop traffic Driver thread Forward Table CPU0 Forward Table CPU1 Driver thread CPU0 CPU1 Driver thread Single Forwarding Table Driver thread CPU0 CPU1 Example: Layer 3 Forwarding Optimization

27 Instrumented Kernel Profile Unoptimized Optimized  10% increase in small packet performance Lock contention

28 Summary QNX Momentics ® Multi-Core Edition  The QNX Momentics Multi-Core Edition provides the industry’s only comprehensive software foundation that addresses the imminent transition to multi-core silicon  The QNX Momentics Multi-Core Edition ► Rapidly move current uni-processor based applications to any multi-processing architecture, decreasing overall time to market ► Quickly build reliable, high performance products that leverage latest generation multi-core processors ► Future proof your designs to scale beyond dual-core to multi-core silicon and beyond to highly distributed systems ► Focus on product differentiation and product delivery rather than plumbing ► Supports all multi-processing models: AMP, SMP or BMP

Download ppt "Designing and Optimizing Software for Intel® Architecture Multi-core Processors Peter van der Veen QNX Software Systems."

Similar presentations

Ads by Google