Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten.

Similar presentations

Presentation on theme: "Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten."— Presentation transcript:

1 Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten

2 Problem Statement Explore characteristics of the P4 Trace Cache using microbenchmarks and performance counters related to branching and Trace Cache

3 Approach Determine characteristics of the Pentium 4 processor that will help us evaluate the P4’s trace cache Using a performance monitoring tool (Intel’s Vtune Performance Analyzer) measure the data we need and analyze it to find limitations on the trace cache

4 Some P4 Characteristics Like most high performance processors, the P4 has special on-chip hardware for performance monitoring. This hardware typically includes Event detectors and counters Qualification of event detections and counting by privilege mode and event characteristics Support for event-based sampling

5 P4 characteristics cont. Common problems faces by modern processors Small number of counters Inability to distinguish between speculative and non- speculative events Imprecise event-based sampling With 42 million transistors (compared to 28 million of the P3), the P4 has overcome these problems 48 event detectors and 18 event counters Provides instruction-tagging to enable counting of nonspeculative performane events Provides support for imprecise event-based sampling (IEBS) and precise event-based sampling (PEBS)

6 Trace Cache Special instruction cache for capturing long dynamic instruction sequences. Each line stores a snapshot, or trace, of the dynamic instruction stream P4 executes trace caches when there is an L1 cache hit (which is over 90% of the time)

7 Characteristics of Trace Cache Stores instructions after they’ve already been decoded into μops (“micro-ops”). μops – RISC-style instructions Cache Line Size: 6 μops Trace Cache Size: 12K μops Branch Prediction hardware is used knows about any branch and fetch instructions that follow the branch. Conditional Branches can cause problems Won’t know if wrong until branch condition check in ALU0

8 Entering The Execution Pipeline - Pentium 4's Trace Cache Tom’s Hardware Guide

9 Advantages of Trace Cache More efficient use of limited cache space. Trace cache lines contain both branch instructions and the code after the branch instruction. No extra latency for branches Does not use TLB check

10 “Execute Mode" (when needed code is in L1 cache) The P4’s Critical Execution Path

11 Execute Mode Vs. Trace Segment Build Mode Execute Mode Trace cache feeds stored traces to the execution logic to be executed. Trace cache normally runs in this mode. Trace Segment Build Mode Used when there is an L1 cache miss Front end fetches x86 code from the L2 cache, Translates into μops, Builds a “trace segment” with it, Loads that segment into the trace cache to be executed.

12 Branch Prediction X86 code with a branch in it: The trace cache builds a trace from instructions up to and including the branch instruction Then picks which branch it thinks the program will take Continues to build the trace along that speculative branch.

13 Microcode ROM Used by P4 to process longer instructions Allows regular hardware decoder to concentrate on decoding the smaller, faster instructions. Stores a sequence of μops for each long instruction encountered. Inserts a tag into the trace segment that points to the section of the microcode ROM where the μop sequence is held. Trace Cache gives control to the Microcode ROM when a tag is encountered until the proper sequence of μops is produced. Execution Engine does not care if instructions come from the Trace Cache or the Microcode ROM

14 VTune Experiment for(i=0; i<1M; i++) _asm { mov eax, 10 mov eax, 20 }

15 VTune Experiment for(i=0; i<1M; i++) _asm { mov eax, 10 … mov eax, 4990 }

16 VTune Results Trace Cache Misses Trace Cache Delivery Mode mov eax, 49800174,605,634 mov eax, 49902,356173,879,264 mov eax, 50003,945174,448,595

17 VTune Results cont. Dis- tanc e Ru n # Spec micro- code Uops Spec TC- built uops Spec TC- delived uops TC Build Mode TC Deliver Mode TC Missesuops Decodeduops Retired 5011 - 224,924 509,636,316 - 176,973,480 4,705 507,140,497 508,671,072 5012 - 222,852 510,233,880 441,313 175,451,130 5,505 512,390,816 509,080,482 5015 41,843 273,615 509,599,380 442,288 177,215,964 10,725 511,918,204 511,939,750 4901 - 86,260 491,929,610 - 172,872,716 1,880 499,609,758 498,960,432 4902 48,614 373,550 498,086,310 - 171,210,494 5,361 497,178,660 497,336,020 4904 - 217,040 500,424,768 382,397 173,107,503 6,444 496,964,597 496,790,932 4501 - 245,190 455,461,376 - 157,471,452 1,877 458,074,872 460,907,257 4502 55,108 82,445 457,650,081 - 154,759,410 5,768 460,827,366 459,866,660 4503 - 193,896 457,591,820 449,223 158,811,048 12,448 460,118,105 459,147,504

18 VTune Results for P4m Dis- tanceRun #Spec Uops retired Spec TC- built uops Spec TC-delived uops TC Build Mode TC Deliver Mode TC Missesuops Retired 1501157,706,7840156,600,752053,391,1824,248158,219,460 1502158,352,3600159,005,262383,18355,624,0162,957157,856,940 1491157,698,2400157,680,678055,166,3197,248158,195,300 1492157,311,3570157,421,964389,10155,592,7685,192157,215,060 1301163,841,0400137,760,210048,314,5690137,856,452 1302139,101,7860137,808,330342,95548,707,7959,054138,242,080 1303140,317,9200138,527,055360,10050,786,6120139,032,684

19 Sources: M. Milenkovic, A. Milenkovic, J. Kulick, “Demystifying Intel Branch Predictors,” Proceedings of the Workshop on Duplicating, Deconstructing, and Debunking (held in conjunction with 29 th ISCA), Anchorage, Alaska, May 2002 E. Rotenberg, S. Bennett, J. E. Smith, “A Trace Cache Microarchitecture and Evaluation,” IEEE Transactions on Computers, (Vol. 48, No. 2) February 1999,3973,1488,00.asp 5.htm 5.htm

Download ppt "Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten."

Similar presentations

Ads by Google