Performance Monitoring Update Daniele Francesco Kruse August 2010.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

TRIPS Primary Memory System Simha Sethumadhavan 1.

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.

Lecture 12 Reduce Miss Penalty and Hit Time

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Performance of Cache Memory

Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.

Performance Monitoring Update Daniele Francesco Kruse April 2010.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Putting it all together: Intel Nehalem Steve Ko Computer Sciences and Engineering University.

RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

Architecture Basics ECE 454 Computer Systems Programming

 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.

Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Virtual Memory Expanding Memory Multiple Concurrent Processes.

AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello

Lecture Objectives: 1)Explain the relationship between miss rate and block size in a cache. 2)Construct a flowchart explaining how a cache miss is handled.

Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.

Software Performance Monitoring Daniele Francesco Kruse July 2010.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.

DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.

Modular Software Performance Monitoring Daniele Francesco Kruse – CERN – PH / SFT Karol Kruzelecki – CERN – PH / LBC.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Real-World Pipelines Idea Divide process into independent stages

Data quality & VALIDATION

6.175 Final Project Part 0: Understanding Non-Blocking Caches and Cache Coherency Answers.

Modular Software Performance Monitoring

University of Seoul 26 September 2009 Minkyoo Choi* , Seo Kon Kang

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Cache Memory and Performance

Yu-Lun Kuo Computer Sciences and Information Engineering

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

W3 Status Analyzer.

Daniel Petrisko, Kenneth Umenthum

Andrew Putnam University of Washington RAMP Retreat January 17, 2008

Computer Structure Multi-Threading

The University of Adelaide, School of Computer Science

INTEL HYPER THREADING TECHNOLOGY

CSE 502: Computer Architecture

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Pipeline Implementation (4.6)

Lecture 12 Reorder Buffers

ECE 445 – Computer Organization

Pipelining: Advanced ILP

The Microarchitecture of the Pentium 4 processor

Seoul National University

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Module 3: Branch Prediction

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Understanding Performance Counter Data - 1

Lecture 10: Branch Prediction and Instruction Delivery

Miss Rate versus Block Size

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

* From AMD 1996 Publication #18522 Revision E

If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?

Control Hazards Branches (conditional, unconditional, call-return)

Performance Analysis in Out-of-Order Cores

Presentation transcript:

Performance Monitoring Update Daniele Francesco Kruse August 2010

Summary Refinement of Nehalem analysis methodology following David Levinthal’s recommendations Added CSV exportation feature (./pfm_analysis results/ -–csv ) for spreadsheet programs (e.g. MS Excel) Simbol level detail accessible directly from modular analysis page links for each module Problems and future work 2

New analysis methodology for Nehalem 3 BASIC STATS: Total Cycles, Instructions Retired, CPI; IMPROVEMENT OPPORTUNITY: iMargin, iFactor; BASIC STALL STATS: Stalled Cycles, % of Total Cycles, Total Counted Stalled Cycles; INSTRUCTION USEFUL INFO: Instruction Starvation, # of Instructions per Call; FLOATING POINT EXCEPTIONS: % of Total Cycles spent handling FP exceptions; LOAD OPS STALLS: L2 Hit, L3 Unshared Hit, L2 Other Core Hit, L2 Other Core Hit Modified, L3 Miss -> Local DRAM Hit, L3 Miss -> Remote DRAM Hit, L3 Miss -> Remote Cache Hit; DTLB MISSES: L1 DTLB Miss Impact, L1 DTLB Miss % of Load Stalls; DIVISION & SQUAREROOT STALLS: Cycles spent during DIV & SQRT Ops; L2 IFETCH MISSES: Total L2 IFETCH misses, IFETCHes served by Local DRAM, IFETCHes served by L3 (Modified), IFETCHes served by L3 (Clean Snoop), IFETCHes served by Remote L2, IFETCHes served by Remote DRAM, IFETCHes served by L3 (No Snoop); BRANCHES, CALLS & RETS: Total Branch Instructions Executed, % of Mispredicted Branches, Direct Near Calls, Indirect Near Calls, Indirect Near Non-Calls, All Near Calls, All Non Calls, All Returns, Conditionals; ITLB MISSES: L1 ITLB Miss Impact, ITLB Miss Rate; INSTRUCTION STATS: Branches, Loads, Stores, Other, Packed UOPS;

First draft results for Nehalem 4 Results for a first analysis on CMSSW are available at the following addresses: The analysis has been carried out on a quad-core single-socket Nehalem system (core i7) with the following configurations: cmsDriver.py recominbias -s RAW2DIGI,RECO -n -1 --filein file:500evt_MinBias_cfi_GEN_SIM_DIGI_L1_DIGI2RAW_HLT_RAW2DIGI_L1Reco.root --eventcontent RECOSIM --conditions auto:mc --no_exec cmsDriver.py recottbar -s RAW2DIGI,RECO -n -1 --filein file:100evt_TTbar_cfi_GEN_SIM_DIGI_L1_DIGI2RAW_HLT_RAW2DIGI_L1Reco.root --eventcontent RECOSIM - -conditions auto:mc --no_exec cmsDriver.py recoqcd -s RAW2DIGI,RECO -n -1 --filein file:100evt_QCD_Pt_3000_3500_cfi_GEN_SIM_DIGI_L1_DIGI2RAW_HLT_RAW2DIGI_L1Reco.root -- eventcontent RECOSIM --conditions auto:mc --no_exec

Problems and future work 5 Perfmon2 not yet compatible with Westmere-based processors Events with custom Umasks don’t work correctly all the time with libpfm Waiting for the final validation of formulas used in the analysis for Nehalem from David Levinthal Deployment for CMSSW asap Deployment for Gaudi & Geant4 (end of August / beginning of September)

Thank you, Questions ?