Designing and Optimizing Software for Intel® Architecture Multi-core Processors Peter van der Veen QNX Software Systems.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Threads, SMP, and Microkernels
Distributed Processing, Client/Server and Clusters
Using MapuSoft Instead of OS Vendor’s Simulators.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Designing High-Performance Network Elements Using Multiprocessing Technology and Adaptive Partitioning Peter van der Veen QNX Software Systems.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Computer Systems/Operating Systems - Class 8
Chapter 13 Embedded Systems
Technical Architectures
INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.
Chapter 17 Parallel Processing.
Chapter 13 Embedded Systems
Figure 1.1 Interaction between applications and the operating system.
Introduction Operating Systems’ Concepts and Structure Lecture 1 ~ Spring, 2008 ~ Spring, 2008TUCN. Operating Systems. Lecture 1.
The Design of Robust and Efficient Microkernel ManRiX, The Design of Robust and Efficient Microkernel Presented by: Manish Regmi
Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.
Performance Evaluation of Real-Time Operating Systems
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Chapter 2 Operating System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Computer System Architectures Computer System Software
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 2: System Structures.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Providing Bluetooth Functionality on Embedded Devices: A look at Embedded Operating Systems and Bluetooth Stacks Brian Fox Supervisors: Dr Greg Foster.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
VxWorks Fall 2005 Final Project CS 450: Operating Systems Section 1 Kenneth White Josh Houck Karl Ridgeway Mike Ripley Morgan Serene.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Full and Para Virtualization
Chapter 2 Operating System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Background Computer System Architectures Computer System Software.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
January 7, 2007 QNX Multi-Core Solution Optimizing Software for Multi-core Kerry Johnson.
Improve Embedded System Stability and Performance through Memory Analysis Tools Bill Graham, Product Line Manager Development Tools November 14, 2006.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
QNX® Momentics® Development Suite Tools for Building, Debugging and Optimizing Embedded Systems.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Introduction to Operating Systems Concepts
Computer System Structures
Processes and threads.
Current Generation Hypervisor Type 1 Type 2.
Threads, SMP, and Microkernels
QNX Technology Overview
Objective Understand the concepts of modern operating systems by investigating the most popular operating system in the current and future market Provide.
Lecture 4- Threads, SMP, and Microkernels
Operating Systems : Overview
Multithreaded Programming
Chapter 2: Operating-System Structures
Operating Systems : Overview
Operating Systems : Overview
Operating Systems : Overview
Chapter 2: Operating-System Structures
Objective Understand the concepts of modern operating systems by investigating the most popular operating system in the current and future market Provide.
Presentation transcript:

Designing and Optimizing Software for Intel® Architecture Multi-core Processors Peter van der Veen QNX Software Systems

Overview  Software and system vendors continue to add features and capabilities that demand more and more CPU performance  Microprocessor vendors can no longer scale performance simply by increasing clock speed ► Thermal considerations ► Design complexity  Trend to include multiple processor cores on a single die  Multi-core designs address performance issues ► Favorable power / performance ratio for embedded systems ► Decreased board area  Companies that can leverage the full capabilities of hardware can achieve a competitive advantage CPU Bridge CPU Bridge CPU Bridge CPU Bridge CPU

Multi-core Architectures  Increased integration on die ► Multiple CPU cores and caches ► High speed, on-chip system interconnect  Greatly reduces latency associated with a traditional board-level interconnect  Memory controller(s) on system bus ► Allows separation of memory for asymmetric operation  On-chip peripherals on system bus ► Maximizes peripheral throughput ► Reduces latency CPU System Interconnect Cache I/O Memory Controller Single Die

Intel Evolution of Parallelism AS Architectural State: registers, flags, timestamp counter, etc. APICAdvanced Programmable Interrupt Controller PERProcessor Execution Resources: caches, execution units, instruction decode, bus interface etc. One die ProcessorExecutionResources ArchitecturalState Interrupt Cntlr (APIC) Classic Uniprocessor One die AS APIC AS APIC ProcessorExecutionResources Hyper-Threading Technology* (HT Technology) One die AS APIC AS APIC PERPER L2 Cache & Bus Interface Multi-coreClassic SMP Chipset AS APIC PER AS APIC PER All of these forms of parallelism are in use today * Hyper-Threading Technology (HT Technology) requires a computer system with an Intel® Processor supporting HT Technology and an HT Technology enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See for more information including details on which processors support HT Technology. Symmetrical Multi-Processing (SMP) with Multi-core Chipset AS APIC AS APIC PERPER Bus Interface AS APIC AS APIC PERPER

QNX and Multi-core  QNX has done the heavy lifting to enable migration to multi-core ► Let developers focus on product differentiation  Reliable, proven support for multi-core applications ► 1997: Industry’s first to bring SMP to embedded ► 1984: High performance, transparent distributed messaging ► Full support for asymmetric and symmetric multiprocessing ► Linux, vxWorks interoperability  Migrate existing software base and enable new multi-core optimized applications  Multi-core capable tool suite  World class professional services and expert training  Active role in developing standards through Multi-core Exchange consortium ► Enable portability of applications across various platforms ► Derive common set of APIs that multi-core development tools can utilize to support interoperability

Microkernel Architecture File System Process Manager Protocol Stack Microkernel Application Microkernel is the only trusted component Audio Driver Graphics Driver Applications and drivers  Are processes that plug into a message bus  Reside in memory-protected address space  Cannot corrupt other software components  Can be started, stopped, and upgraded on the fly Message Bus …

Multiprocessing Models  Two cores, two OSs  Same (homogeneous) or different (heterogeneous) OS CPU OS 2OS 1 AsymmetricSymmetric  Two cores, one OS Single OS Instance CPU

Asymmetric Processing  Asymmetric Model Pros: ► Only possible mode when different OSs are running ► CPU core can be dedicated to specific applications ► One possible mode for applications that cannot operate with parallel processing  Asymmetric Model Cons: ► Resource sharing / arbitration needs to be designed into system by developers  Neither OS “owns” the whole system  Memory, I/O, interrupts are shared  Evolution - complexity increases as cores are added  Static configuration, difficult to add dynamic resourcing  Time to market?  Any HW contention must be dealt with by designer ► Synchronization between cores done through application level messages  Sub-optimal performance  Complexity of the problem is not linear  Addition of cores may require re-architecting application to increase performance CPU System Interconnect Cache I/O Memory Controller I/O Shared Memory OS 1 Memory OS 2 Memory OS 2OS 1 Applications Managing shared resources complicates design

fd = open(“/dev/ffs1”,…); write(fd, …); Message Bridge (Ethernet, RapidIO,Shared Memory) Flash File System Database Application Microkernel Core 1 Message Queues Networking Stack Flash File System Application Microkernel Core 0 Internet Message-Passing Bus Neutrino Homogeneous AMP Transparent Distributed Processing  Extends message passing bus over a transport layer  Applications / services can be built in a fully distributed manner without special code ► Message queues ► File systems ► Hardware ports  Seamless sharing of I/O resources between cores (e.g. use a serial port “owned” by another core) fd = open(“/net/core0/dev/ffs1”,…); write(fd, …);

Symmetric Processing  Symmetric Model Pros: ► Highly scalable. Supports multiple processing cores seamlessly without code modification ► One OS “sees all” and handles all resource sharing / arbitration issues ► Dynamic load balancing handles processing bursts with OS thread scheduling ► Dynamic memory allocation = all cores can draw on full pool of available memory without penalty ► High performance inter-core messaging synchronization  Core-to-core synchronization using OS primitives ► System wide statistics / information gathering capability for performance optimizations, debugging, etc.  Symmetric Model Cons : ► Load balancing is dynamic and application may require dedicated CPU ► Applications with poor synchronization among threads may not work properly  Difficult to change software  3 rd party software CPU System Interconnect Cache I/O Memory Controller I/O Memory OS Applications

Multi-core Scaling Software  QNX conforms to POSIX (Portable Operating System Interface) Application Programming Interface ► Allows straightforward porting of code from one OS to another that is also conformant  Application broken down into memory protected units called processes  Processes further divided into internal, schedulable units called threads ► Threads share all of the same resources (memory space included)  PROCESSES run on individual cores concurrently in asymmetric mode (all threads for a process are tied to one core)  THREADS run on individual cores concurrently in symmetric operation Threads Process Threads Process Application

Thread Running Process Ready queues 255 Priority Thread Blocked states Thread Process Active Threads and Ready Queues SMP CPU 0 CPU 1

AMP or SMP?  Sometimes this can be a clear cut decision ► Two operating systems = AMP ► Application requires all available CPUs to maximize performance = SMP  What if the versatility of SMP is desired but the control of AMP is needed?

QNX Bound Multiprocessing CPU System Interconnect Cache I/O Memory Controller I/O Memory OS A2A1A5A3A4  Benefits of both AMP and SMP  Support legacy code base and multi-core optimized applications simultaneously ► Supports bound and symmetric operation, selectable by process / thread  Designer has full control over applications ► Applications and/or threads can be “bound” to a specific core ► Restrictive CPU usage as decided by designer  Load balancing ► OS dynamic or designer controlled ► Tools to optimize load balancing ► Resource sharing handled by OS  Single OS has full visibility and control ► Resource sharing handled by OS, simplifies design process ► System wide statistics / information gathering capability for performance optimizations & debugging  High Performance ► Kernel support for message passing and thread synchronization The Best of Both Worlds

Active Threads and Ready Queues: BMP Thread Running Process designated CPU 0 Ready queues 255 Priority Thread CPU 0 (Available) CPU 1 Scheduler Available CPU runs highest-priority CPU-designated thread Thread User controls which CPU will run a process’s threads. All threads in a process are tied to one CPU. Process designated CPU 1

Multiprocessing Summary Design ConsiderationSymmetricBoundAsymmetric Seamless Resource Sharing  Scalable beyond dual core  Legacy application operation ? Mixed OS environment  Dedicated processor by function  Inter-core messaging Fast (OS primitives) Fast (OS primitives) Slower (Application) Thread synchronization between cores  Load balancing  System wide debug and optimization 

The Transition to Multi-core The Role of Tools

 The right toolset eases the transition to multi-core processors  Assess current software when moving to multi-core ► Should processes be separated between cores?  Determine how closely coupled the current processes are ► Where can concurrent processing help?  Show the current processing bottlenecks  Debugging in a multi-core environment ► Characterize and debug interaction between threads on multiple CPUs  Tuning and Optimization in a multi-core environment ► Move processes and threads between cores ► Examine processing bottle necks ► Examine inter-process communications

Microkernel Instrumented Kernel The instrumented kernel logs events which are filtered and stored into buffers which are captured and analyzed State changes Interrupts Process/thread creation System calls System Profiler Events On/Off filters Static event filters User defined filters E1E2E3E4E5E6 Event buffers Capture File Network

Thread / Process Coupling: QNX Momentics System Profiler Determine amount of messaging between processes.

Load Balancing: QNX Momentics System Profiler Measure CPU activity for all cores to determine optimal load balancing

Intel® C++ Compiler 8.1 for QNX Neutrino ® RTOS  Compiler based on classic Intel® C++ Compilers for desktop/server markets** ► Leverages mature Intel compiler technology ► Leads industry in supporting Intel Architecture’s performance features and *T technologies  Cross-compiler: ► From Windows to QNX Neutrino RTOS  Superior performance (see benchmarks)  Integrates into QNX Momentics* Development Suite  GCC C/C++ Object compatibility and interoperability Download free 30-day evaluation

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference or call (U.S.) or Configuration Info: Intel® C++ Compiler 8.1 for QNX Neutrino* RTOS, GCC Intel® Pentium® 4 Processor, 3.0 GHz, 512 KB L2 Cache, 512MB Memory QNX Neutrino* RTOS 6.3 EEMBC 1.1 scores were not certified by ECL. Out-of-the-box performance was measured. Relative performance was computed by averaging relative performance on Automotive, Consumer, Networking, Office Automation, and Telecomm tests. EEMBC* 1.1 Intel® Pentium® 4 Processor (Embedded Microprocessor Benchmark Consortium*)

The Transition to Multi-core Software Architecture and Optimization

Optimizing Multi-core Applications  Reduce contention ► Minimize or remove core-core interactions to ensure most parallelism ► Ensure no serialization between competing tasks due to resource contention  Scale to number of available processors  Use system analysis tools to tune performance  Asymmetric operation ► Properly partition to produce desired CPU loading for each core  Symmetric operation ► Asymmetric application operation ► Thread affinity ► Bound Multiprocessing for dedicated CPU allocation  Select proper thread / process priorities to optimize real-time performance / CPU allocation

 Original implementation  Lock contention and cache misses in forwarding table  Serializes Rx / Tx operations  No lock contention for FW table ► One table per CPU  Minimizes cache contention and snoop traffic Driver thread Forward Table CPU0 Forward Table CPU1 Driver thread CPU0 CPU1 Driver thread Single Forwarding Table Driver thread CPU0 CPU1 Example: Layer 3 Forwarding Optimization

Instrumented Kernel Profile Unoptimized Optimized  10% increase in small packet performance Lock contention

Summary QNX Momentics ® Multi-Core Edition  The QNX Momentics Multi-Core Edition provides the industry’s only comprehensive software foundation that addresses the imminent transition to multi-core silicon  The QNX Momentics Multi-Core Edition ► Rapidly move current uni-processor based applications to any multi-processing architecture, decreasing overall time to market ► Quickly build reliable, high performance products that leverage latest generation multi-core processors ► Future proof your designs to scale beyond dual-core to multi-core silicon and beyond to highly distributed systems ► Focus on product differentiation and product delivery rather than plumbing ► Supports all multi-processing models: AMP, SMP or BMP