Presentation is loading. Please wait.

Presentation is loading. Please wait.

OPTERON HARDWARE PERFORMANCE COUNTERS Richard Smith Technical Specialist Data Centre Practice Sun Microsystems Australia.

Similar presentations

Presentation on theme: "OPTERON HARDWARE PERFORMANCE COUNTERS Richard Smith Technical Specialist Data Centre Practice Sun Microsystems Australia."— Presentation transcript:

1 OPTERON HARDWARE PERFORMANCE COUNTERS Richard Smith Technical Specialist Data Centre Practice Sun Microsystems Australia

2 2 Why does my code take xxxxx seconds to execute? Opteron pipeline 3 instructions/cycle 2.6 GHz (cycles per second) 10 15 instr 35 hrs ??? Many factors involved: ● Instruction Level Parallelism (ILP) ● Memory latency ● System bandwidth ● Thread Level Parallelism (TLP)

3 3 Opteron Microarchitecture Features ● Deep OOO integer and FP execution ● Fetch/Decode 3 instructions/cycle (max 16 bytes) ● 3-way integer + 3-way address + 3-way FP exec ● 64KB L1 D$ and 1MB L2 D$ on-chip ● x86_64 extensions to x86 (“x64”) ● Integrated DDR1 memory controller ● 3 x 16b HyperTransport  interfaces

4 4 Opteron Performance Counters PerfEvtSel0 PerfCtr0 PerfEvtSel1 PerfCtr1 PerfEvtSel2 PerfCtr2 PerfEvtSel3 PerfCtr3 48-bit counters: bits 63--48 are reserved UNIT_MASKEVENT_MASK INV EN INT PC EDGE USR 3124221615870 Processor Functional Unit (FP, LS, DC, BU, IC, FR, NB) CNT_MASK 0x004108cb ==> EN, USR, scalar SSE+SSE2 instr, Retired FPU instr cputrack -c pic2=FR_retired_fpu_instr,umask2=0x08... OS

5 5 Solaris and HW Counters #26094 BIOS and Kernel Developer's Guide Lists more than known by Solaris kernel > New counters for processor revisions D and E Nevada source available at cc -D_KERNEL -xarch=amd64 -xmodel=kernel -c opteron_pcbe.c ld -r -o pcbe.AuthenticAMD.15 opteron_pcbe.o Virtualised counter support built-in > Linux requires perfctr patch

6 6 Using the Counters cputrack(1) cpustat(1M) libcpc(3LIB) collect (Studio 11 collector/analyzer) perfctr (linux) PAPI NB: Some counters are not duplicated on dual-core Opteron (rev E and older)

7 7 Dual-core Opteron CPU1 1MB L2 Cache Memory Controller HT0HT1HT2 CPU0 1MB L2 Cache System Request Interface Crossbar Switch

8 8 HyperTransport 2, 4, 8, 16, or 32 bits @ 200 to 1000MHz Device ADevice B HT is a scalable point-to-point link Coherent HT is used to connect processors Non-coherent HT is used for i/o connectivity (PCI semantics map neatly) 1xx, 2xx and 8xx cpus differ in number of coherent HT interfaces Basic unit of transmission is a Dword (4 bytes) 2B per clock edge @ 1000MHz ==> 4GB in each direction

9 9 HyperTransport on V20Z 248 4B Dwords Command Data (max 64B payload) Buffer Release (NOP?) NOP 800 MHz DDR cpu 0cpu 1 012012 I/O 800 MHz x 4B/clock = 3.2 GB/s each way 800M Dword/s full duplex HT link Measured via cpustat(1m)

10 10 NUMA Architecture (AMD: SUMA) Each hop adds 30 – 40ns latency Minimising #hops improves performance and reduces system bandwidth consumed

11 11 Memory Bandwidth Test (per sec) cpu 1cpu 0 76M probes: 153M Cmd + 0M Data + 76M BufRel execute 76M Cmd + 0M Data + 76M BufRel cpu 1cpu 0 37M probes: 113M Cmd + 1M Data + 150M BufRel execute 150M Cmd + 600M Data + 37M BufRel 4891 MB/s 2402 MB/s Local Memory Remote Memory

12 12 HT Usage via cpustat cpustat -c \ pic0=NB_ht_bus2_bandwidth,umask0=0x01,\ pic1=NB_ht_bus2_bandwidth,umask1=0x02,\ pic2=NB_ht_bus2_bandwidth,umask2=0x04,\ pic3=NB_ht_bus2_bandwidth,umask3=0x08,sys \... -c pic0=NB_probe_result,umask0=0x0f,sys \ -p 1 1 & NB: Only one set of HT counters on dual-core cpus

13 13 Local vs Remote Memory Access Revision E? Event 0xE9 CPU/IO Requests to Memory/IO > umask 0xA8 Local => Local > umask 0x98 Local => Remote Doesn't distinguish between reads and writes

14 14 Opteron Pipeline L1 Instruction Cache 64KB 44-entry Load/Stor e Queue L2 Cache L1 Data Cache 64KB Crossbar Memory Controller HyperTransport TM System Request Queue Fetch Int Decode & Rename  OPs 36-entry FP scheduler FAD D FMI SC FM UL Branch Prediction Instruction Control Unit (72 entries) Fastpath Microcode Engine Scan/Align FP Decode & Rename AG U ALU AG U ALU MU LT AG U ALU Res Bus Unit

15 15 Pipeline Throughput 76h BU_cpu_clk_unhalted C0h FR_retired_x86_instr_w_excp_intr C1h FR_retired_uops CBh FR_retired_fpu_instr > x87, MMX, packed and scalar SSE[2] 00h FP_dispatched_fpu_ops > add, multiply, store,... 01h FP_cycles_no_fpu_ops_retired

16 16 Understanding Pipeline Stalls D1h FR_dispatch_stalls D2h FR_dispatch_stall_branch_abort_to_retire D5h FR_dispatch_stall_reorder_buffer_full > maximum of 72 inflight instructions (24 x 3 lanes) D6h FR_dispatch_stall_resv_stations_full > ALU and AGU ops 24 entries (8 x 3 schedulers) D7h FR_dispatch_stall_fpu_full > 36 FP instructions across 3 schedulers 23h LS_buffer_2_full > 12 LS1 entries and 32 LS2 entries

17 17 Prefetch Activity 67h BU_data_prefetch > Prefetch attempts and cancelled prefetches > Includes HW prefetcher activity 4Bh DC_dispatched_prefetch_instr > Prefetches are strong: not dropped on DTLB miss > Load (T0/T1/T2) > Store (PrefetchW) > NTA (for low-reuse data, avoids polluting L2)

18 18 Cache Counters Flow dc accesses l2 fill dc misses dc victim refill from l2 page hit/ miss/conflict L1L2 memory controller memory nb_sized_command nb_ht_busx_bandwidth HT (hw prefetch) refill from system probe prefetch l2 miss Very approximately!

19 19 DDR Memory Access page hitpage misspage conflict Trp (precharge delay) Trcd (RAS to CAS delay) Tcl (CAS latency) ● Opteron generally uses 16B wide memory transfers ● 16B x 400MT/s ==> 6400MB/s ● 200MHz x 2 edges ● bank select ==> row select ==> column select ● latency dependent on hit in memory controller “open page” cache ● How do physical addresses (PA) map to (bank, row, column)? NB_mem_ctrlr_page_access

20 20 Controlling Memory Locality lgroups on Solaris 10 > future: hierarchy of memory locality processor_bind() and madvise() > keep a thread and its data together ppgsz -oheap=2m for large pages > fewer DTLB misses: 32 x 2m vs 512 x 4k meminfo() and/or pmap -sx > currently kernel cage issues may lead to memory fragmentation so large pages not always available


Download ppt "OPTERON HARDWARE PERFORMANCE COUNTERS Richard Smith Technical Specialist Data Centre Practice Sun Microsystems Australia."

Similar presentations

Ads by Google