System Level Benchmarking Analysis of the Cortex™-A9 MPCore™ John Goodacre Director, Program Management ARM Processor Division October 2009 Anirban Lahiri.

Slides:

Advertisements

Similar presentations

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.

Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

ARM Cortex-A9 MPCore ™ processor Presented by- Chris Cai (xiaocai2) Rehana Tabassum (tabassu2) Sam Mussmann (mussmnn2)

Performance of Cache Memory

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

KeyStone ARM Cortex A-15 CorePac Overview

Optimization on Kepler Zehuan Wang

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

1 Targeted execution enabling increased power efficiency John Goodacre Director, Program Management ARM Processor Division August 2009 MPSoC 2009 Anirban.

Chapter 18 Multicore Computers

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

Performance of mathematical software Agner Fog Technical University of Denmark

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Playstation2 Architecture Architecture Hardware Design.

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

Sunpyo Hong, Hyesoon Kim

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

My Coordinates Office EM G.27 contact time:

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

William Stallings Computer Organization and Architecture 8th Edition

Nios II Processor: Memory Organization and Access

Reducing Hit Time Small and simple caches Way prediction Trace caches

Lecture 12 Virtual Memory.

Parallel Processing - introduction

Lynn Choi School of Electrical Engineering

The University of Adelaide, School of Computer Science

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

5.2 Eleven Advanced Optimizations of Cache Performance

Architecture Background

Introduction to Pentium Processor

Improving cache performance of MPEG video codec

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Comparison of Two Processors

Simultaneous Multithreading in Superscalar Processors

Lecture: Cache Innovations, Virtual Memory

Alpha Microarchitecture

* From AMD 1996 Publication #18522 Revision E

Multithreaded Programming

William Stallings Computer Organization and Architecture 8th Edition

Lecture: Cache Hierarchies

William Stallings Computer Organization and Architecture 8th Edition

CS 295: Modern Systems Cache And Memory System

Overview Problem Solution CPU vs Memory performance imbalance

ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.

Presentation transcript:

System Level Benchmarking Analysis of the Cortex™-A9 MPCore™ John Goodacre Director, Program Management ARM Processor Division October 2009 Anirban Lahiri Technology Researcher Adya Shrotriya, Nicolas Zea Interns This project in ARM is in part funded by ICT-eMuCo, a European project supported under the Seventh Framework Programme (7FP) for research and technological development

2 Agenda  Look at leading ARM platforms – PBX-A9, V2-A9  Observe and analyse phenomena related to execution behaviour – especially related to memory bandwidth and latency  Understand the benefits offered by ARM MPCore technology  Strategies for optimization

3 Cortex-A8: Mainstream Technology  Fully implements ARM architecture v7A instruction set  Dual Instruction Issue  a NEON pipeline for executing Advanced SIMD and VFP instruction sets  Memory Management Unit (MMU)  Separate instruction and data Translation Look- aside Buffers (TLBs)

4  Superscalar out-of-order instruction execution  Register renaming for speculative execution and loop unrolling  Small loop mode for energy efficiency  PTM interface for tracing  Counters for performance monitoring  Support for multicore configurations Cortex-A9: Technology Leadership

5 ARM Versatile PBXCortex-A9 Platform Dual core ARM Cortex A MHz 32KB I&D L1 caches 128K L2 cache 1GB RAM – split in two blocks of 512MB ( DDR1 & DDR2) Memory runs at CPU frequency File-system on Compact Flash VGA / DVI out

6 ARM Versatile2-A9 Platform (V2)  ARM-NEC Cortex-A9 test-chip ~400MHz  Cortex A9 x 4  4x NEON/FPU  32KB I&D invidual L1 caches  512K L2 cache  1GB RAM (32b DDR2) ARM recently announced a 2GHz dualcore implementation

7 Software Framework  Linux - Kernel  Debian 5.0 “Lenny” Linux file-system compiled for ARMv4T  LMBench 3.0a9  Includes STREAM benchmarks  Single instance and multiple instances

8 Memory Bandwidth – PBX-A9 Single Instance2 Instances  Consider a 2 core platform  Knees indicate cache sizes (small [128k] L2  RAM for PBX-A9)  Increased effective memory bandwidth for multicore (2 cores)  Cache bandwidth – doubles  DDR2 memory bandwidth – doubles  Agnostic to alignment Note: Pre-fetching disabled for normalization

9 Memory Bandwidth – V2-A9 Single Instance4 Instances  Consider 4 core platform - running 4 concurrent benchmarks (instead of 2)  Also at 4 times the frequency of the PBX-A9  b/w showing good 4 cores scalability  Increased effective memory bandwidth for higher parallel load  L1 Cache bandwidths – becomes 4 times  DDR2 Memory bandwidth – is only showing a doubling…. Note: Pre-fetching disabled for normalization

10 Example Misconfigured System !!!  Write bandwidth greatly affected if caches are configured as write-through  Remember to configure caches as write-back, with allocate-on -write

11 Bandwidth-Latency Relation  Latency determines the response time for applications on a multicore  Applications requiring short bursts of memory accesses can run concurrently with bandwidth heavy applications without any observable degradation – if latency remains constant Core0Core1 Internet Browser Video / Image Processing

12 Memory Latency – PBX-A9  Similar latencies for Single(S) and Two (M) Instances of LMBench running concurrently  Memory latency almost unaffected by presence of multiple (2) cores  Stride of 16 acting like automatic prefetching for 32 byte cache lines  LMBench tries to use backward striding to ‘defeat’ prefetching  Cortex A9 supports prefetching for both forward and backward striding – disabled in these test for result normalization  Backward striding is less common for real-life applications Small but visible L2

13 Memory Latency – V2  4 Instances of LMBench running - 4 times the application load  Memory latency goes up only by about 20%  Application on one CPU mostly unaffected by execution on other CPUs  Within the limits of memory bandwidth to DDR Memory Single Instance4 Instances

14 Summary and Recommendations  Running multiple memory intensive applications on a single CPU can be detrimental – cache conflicts  Spreading Memory and CPU intensive applications over the multiple cores provides better performance

15 STREAM Benchmarks – PBX-A9  Bandwidth almost doubles for multiple (2) instances compared to the execution of a single instance  Corresponding penalty on latency is marginal  Good for streaming, data-intensive applications

16 L2 Latency Configuration  PL310 allows configuring the latencies for the L2 cache data & tag RAMs  Optimization: Find the minimal latency value for which the system would still work  The difference in performance can be double or more  Remember DDR memory controllers (PL34x) have similar setting Additional Latency (cycles)

17 Memory Load Parallelism  Indicates the number possible outstanding reads  Memory system design determines the ability of the processor to hide memory latency  Support for number of outstanding read/writes essential for multicores – fully supported by PL310 / PL34x  L1 supports 4 linefill requests on average while the implemented DDR2 memory system 2  Systems should support as much memory parallelization as possible

18 Context Switch Time – PBX A9  When all the processes fit in the cache the context switch time remains relatively low  Beyond this it approaches a saturation determined by available main memory bandwidth  Keeping the number of active processes low on a processor vastly improves the response time  Response time for an application  Number of active processes * context switch time

19 Context Switching Time – PBX A9  Peak context switch time increases by a small fraction ( < 20%)  Indicates that context switches on separate processors are almost mutually orthogonal and enables the MPCore to support more active tasks than a single core time-sliced processor before the system becomes unresponsive Single Instance2 Instances

20  Cache lines can migrate between L1 caches belonging to different cores without involving the L2  Clean lines – DDI (Direct Data Intervention )  Dirty Lines – ML (Migratory Lines) ARM MPCore: Cache-to-Cache Transfers

21 Cache to Cache Latency  Significant benefits achievable if the working set of the application partitioned between the cores can be contained within the sum of their caches  Helpful for streaming data between cores  may be used in conjunction with interrupts between cores  Though dirty lines have higher latency they still have  50% performance benefit

22  Illustration of typical system behaviours for Multi-cores  Explored the potential benefits of multicores  Insights to avoid common system bottlenecks and pitfalls  Optimization strategies for system designers and OS architects Conclusion

Thank you !!!