Quantifying Instruction Criticality

Quantifying Instruction Criticality
Eric Tune, Dean Tullsen, Brad Calder UC San Diego, CSE Dept.

Critical Path Prediction
Classify instructions as: Critical delaying execution of instruction delays whole program. Non-critical delaying instruction does not delay program. In our research, I and my co-authors have been interested in critical path prediction. That is, we wanted to classify instructions as being critical or non-critical. Critical instructions are those instructions where delaying the execution of that instruction delays the whole program. Usually, reducing the latency of critical instruction, or breaking their associated dependencies will result in the program running faster. A non critical instruction is one where delaying its execution by some amount may does not delay the program. Likewise, executing a non-critical instruction sooner will not result in the program running any faster. Thus, any effort expended by the processor to speed non-critical instructions through the pipeline is wasted.

Classify instructions as they enter pipeline. in hardware Adapt to changes in program Handle instructions according to criticality. value prediction wasteful on non-critical instructions [Tune et al. HPCA 2001] multi-speed functional units Slower path for non-critical instructions [Seng et al. MICRO 2001] The idea of critical path prediction is to classify dynamic instructions as they enter the pipeline as being either critical or non-critical. Then, to handle instructions differently according to their criticality. This is all supposed to happen in the microprocessor hardware as the program is running, for each dynamic instruction. It is desirable to do this in hardware, for one thing, in order to adapt quickly to changes in the program. When instructions are classified as critical or non-critical, they can be handled accordingly. For example, one might choose to value predict only critical instructions. Value predicting a non-critical instruction would be useless. By reducing the number of instructions that are candidates for being value predicted, better use may be made of limited resources, such as a finite value prediction table. As another example, once we have identified non-critical instructions, we may choose to intentionally delay those instructions. For example, we may send non-critical instructions to slower functional units or slower pipeline clusters, which in turn saves power.

Identify critical instructions Heuristic [Tune, et al, HPCA 2001] Graph + Token [Fields, et al, ISCA 2001] Predict based on past behavior PC-indexed table of counters. 1 2 Now knowing exactly what instructions are critical or not critical before they even execute is not possible in general. Therefore, we only make a prediction as to what instructions will be critical or non-critical. Critical Path Prediction consists of two steps. First, some time after an instruction has executed, we identify whether that instruction was critical or not-critical. Different methods have been identified for determining the criticality of instructions. We first proposed several heuristics for identification of critical instructions. We looked for clues in the pipeline, as to what instructions might have been critical, such as identifying as critical those instructions that wait at the head of the reoder buffer or of the instruction queue of an out-of-order processor. Fields, et al. introduced a dependence-graph model of microprocessor constraints, and a scheme to trace the critical path though that graph by passing a token between virtual nodes of that graph. Regardless of what method is used to identify critical instructions, the next step is to predict whether future instances of that instruction will be critical. In both these studies, the instructions identified as critical would increment a counter in a PC-indexed table of counters. When the same PC is fetched again, it may be predicted as critical if the counter is above a threshhold.

Goals Build framework to ask questions about critical path.
Compare graph-based critical path model to detailed simulation. Develop finer classification than critical/non-critical Find best instructions to “optimize”. Study how criticality of static instructions vary. Improve prediction accuracy Ive just given you some background on critical path prediction. What I have talked about so far is prior work. Entering into this work, we wanted to build a framework that would allow us to ask several questions about critical instructions. First, we wanted to see how well the models of the critical path that had been proposed before actually represented the behavior of a real processor, or at least a detailed simulation of one. We identified one way in which such a model could be extended to include more detail about the processor. And we identified an area where such models can fall short. Secondly, we wanted to develop a finer classification for instructions. Rather than making a binary determination: critical or non critical – we wanted to be able to say which instructions, of critical instructions, were the most critical. If we could identify a very few instructions with a large potential impact on execution time, then we envisioned that we could apply certain microarchitectural optimizations which would could only be applied to a limited number of instructions. One example of such an optimization might be a speculative thread that precomputes a value for a single instruction, or value prediction from a very small value prediction table. Finally, we wanted to know how static instructions vary in their criticality. We wondered if instructions changed their criticality throughout the run of the program, and if they did so in a predictable fashion. This could lead to more accurate predictor tables which would increase prediction accuracy, which would increase the usefulness of critical path predictions.

Our approach Run detailed simulation Build graph representation.
3-node graph model of Fields, et al. extended to model issue width Tool computes effect of changes to graph. Delay or advance when single instruction produces result. Determine effect on program runtime. Repeat for every instruction Dont have to recompute whole graph. We started by running a detailed simulation to generate a trace of what instructions were executed, their latencies, branch mispredictions, etc. We used this to build a graph representing the program. We started with the 3-node graph proposed by Fields et al. The key feature of their model is that it is a graph of dependencies with 3 nodes for each dynamic instruction. It models the effect of a finite instruction window and branch mispredictions. We extended this graph to model the effect of a limited issue with. We built a tool to allow us to compute the runtime of the program before and after changing the graph to represent executing an instruction later, or making its result available sooner. We then used this tool to compute the effect of changing when instructions executed, and compared that to complete resimulations of the program with the same, single instruction changed. Then, we change the time when the result of a particular instruction was made available. We then recompute the longest path through this graph which in turn represents the the exection time of the whole program. We then repeat the process of adjusting the forward and backward when each dynamic instruction in the program trace, to determine its effect on the runtime of the program. Fortunately, as we explain in the paper, you do not need to recompute the length of the entire graph to know with certainty the effect of moving an instruction on the overall runtime of the program.

Tautness/Slack Slack Tautness
Delay an instructions result until it affects runtime. Tautness Make an instructions result available when it is fetched, measure reduction in runtime. Not all critical instructions equally good targets for optimization. We used our approach to meaure two quantities for each instruction. We measured the number of cycles that an instruction could be delayed without affecting the runtime of the program. This is the slack of the instruction. The idea of slack is not a new one. However we also propose a measurement that is complementary to slack, which we beleive to be a novel one. We called this measurement tautness. Tautness is a first order approximation of how much benefit can be gained, in terms of reduction in runtime, from somehow optimizing a particular instruction. To measure the tautness for an instructions, we measured the effect on the runtime of the program of making the result of an instruction available as soon as it was fetched. This is analagous to what would happen if an instruction was value predicted. Although, in one sense, all critical instructions are the same, in that delaying them will increase the runtime of the program, in another sense, they can be very different. Depending on what other paths are parallel to the critical path, a particular critical instruction may, in one extreme, have no paths nearby which are anywhere near as long, or it may, in the other extreme, have another nearby path which is just as long. Thus all instructions are not equally good targets for optimization.

Tautness Tautness of instruction X = 13 cycles
Let me give some examples. In this figure, I used the blue boxes to represent instructions. The lenght of the boxes, in the horizontal direction, represents the the execution latency of the instructions. The edges in the graph represent data dependencies. For clarity, I do not show the additional nodes or dependencies used in the 3-node model. Looking at the graph, you can see that the longest path from the leftmost instruction to the rightmost instruction goes though the big wide box on top, which represents an instruction with a latency of 16 cycles. I highlighted that path in red. In this example, we would say that instruction X, the wide box, has a tautness of 13 cycles, because the next longest path that does not include X is 3 cycles long, through the 3 1 cycle instruciton on the bottom. dependence

Tautness Tautness of instruction X = 1 cycle Instruction X = 16 cycles
5 5 5 Contrast the previous example with this example. In this case the second-longest path is 15 cycles long, versus the longest path which is still 16 cycles long. Here, we define the tautness of instruction X to be only 1 cycle.

Tautness Approximates benefit from “optimizing” instruction.
Not same as latency of instruction. The point of this metric, tautness, is to identify which instructions stand out on the critical path above other paths, and are thus the best candidates for optimizations. When I say optimization, i mean some using some microarchitectural resource to speed the instruction through the pipeline. Thus, the latency of an instruction alone does not determine how much can be gained from “optimizing” it.

Tautness This graph shows a cumulative distribution of tautness values for all the instructions in simulated segments of several different benchmarks. The x-axis represents the number of cycles of tautness that an instruction has. The y-axis represents percentage of dynamic instructions in a benchmark that have at least that much tautness. One thing that stands out to me in this graph is the fact that most of the curves are fairly smooth. This indicates that, even though there are only a few possible latencies for instructions, that the tautness of instructions varies over a wide range of values, emphasizing the point that the latency of an instruction is does not solely determine its effect on runtime. Second, I see that some benchmarks, like parser, on top, have a large percentage of instructions, maybe 5% of instructions, which contribute more than 50 cycles each to the runtime of the program. On the other hand, some programs, vpr have almost no instructions which contribute more than 20 cycles to the runtime of the program. I should point out that the lines intersect the y axis at less than than 100% because these distributions only include instructions that were critical. Non-critical instructions would typically have 0 cycles of tautness.

Validation Validation by complete resimulation with one instruction adjusted. 100 200 While the graph-based model captures many features of a real processor, it does not capture all the constraints. To better understand what is not covered, and to validate our results, we wanted to compare the changes in program runtime computed by our tools and the graph-based model, with our detailed simulation. We compared the amount of slack or tautness that we measured for randomly selected dynamic instructions in the program, and then we repeated the detailed processor simulation, making changes to just that one dynamic instruction. So, to validate the results for several hundred randomly chosen instructions, we performed several hundred complete resimulations. These graphs show the correlation between slack, on the right, and tautness, on the left, as measured by the graph based approach, and by complete resimulation. Points on the x=y line represent perfect agreement between the detailed simulator and our graph-based computations. You can see that for this benchmark, the tautness measurements are very well correlated. However, the correlation for the slack values is not as good. We investigated the reasons for disagreement for many of the outlying points. One reason for a number of the outlying points is that the graph-based model does not capture the relationship between loads which access the same cache line. The benchmark show here is twolf. Tautness (cycles) Graph Slack (cycles) Graph Tautness (cycles) Resim. Slack (cycles) Resim.

Cache Line Problems Load A not on “critical path” Access same
Miss Hit One reason for disagreement between the graph-based model, and the detailed simulation is that the graph based model cannot capture the relationship between loads that access the same cache line. Consider two loads which access the same cache line. I have represented instructions in the same way as before in this example. I represented the latency of an instruction by the width of a box. Additionaly, if you imagine time going from left to right, the position of the boxes represents when instructions are executing. This is basically the same as a gant chart. The first load, load A, is a miss. But it completes first. The second load, which accesses the same cache line, but later on, is thus a hit. When the graph is constructed from the execution trace, the actual load latencies are used. Thus the critical path is computed to go through the instructions along the bottom, including the second load instruction, load B. Load B

Cache Line Problems Load A appears to have slack
according to assigned latency from execution trace. Load A, now delayed Miss Hit Now, if we calculate the slack for the first load, which missed, we might find that it has several cycles of slack. It appears just from the graph that we can safely execute Load A later Load B

Cache Line Problems Actually, Load B becomes miss.
Problem: load latencies fixed in graph, but not in simulation. Load A, now delayed Miss Part. Miss But, since Load B depends on Load A to prefetch for it, moving load A later would cause Load B to become a partial miss, which would thus extend the critical path. The problem is that the graph is assigned fixed latencies from an execution trace, but the latency of load instruction may depend on the relative timing of other load instructions. Fixing this might be difficult, since, when the graph no longer has fixed latencies assigned to it, we cannot apply the linear time longest path algorithms, but would have to resort to more complex algorithms. We believe that this problem affects other research that has been done to measure the slack of instructions as well. Load B becomes miss

Variability of criticality
Percent of Static Instructions The third question we asked in our research was whether static instructions are biased in their criticality. That is, if a particular static instruction was critical in the past, will it be critical in the future. The answer is NO. The extent of the y axis represents all static instructions that were ever critical in each of, I think, 16, spec 2000 benchmarks. The green bars in the this graph shows the fraction of those static instructions that are execution critical 95% of the times they are executed. The blue bar shows those static instructions that are critical half of the times that they are executed. Looking at all the benchmarks in this graph, we can see that there are few instructions that are are critical even half the time. So, attempts to accurately predict what instructions are critical based on past behaviour of the same static instructions, are not likely to work especially well. Looking in particular you can see that for equake and swim, there are no, or almost no, instructions that are critical even half the times that they are executed. Why is this? For the regions that we were simulating, both these programs are executing in small inner loops, with occasional load misses. When consecutive load misses dont fit in the reorder buffer together, then many of the instructions between the load misses are not on the critical path, but rather the critical path skips over them as a result of a full reorder buffer. So, only the occasional load misses are critical, but the other instructions are not. (Drop this last bit?)

Handling Variability Biased counters Correllation
Correctly predicts more critical instructions. Many false critical predictions Want to cover critical, ok some extra non-critical. Correllation Try to improve prediction accuracy overall. So, how can one accurately predict what dynamic instructions will be critical when the same static instruction is not always critical. One approach, which we proposed in our previous work, is to use a biased counter in the prediction table. That is, if some PC was critical the previous time it was executed, we will predict that it is going to be critical the next _8_ times it is executed. The drawback of this approach is that it may identify as critical many instructions that were in fact not-critical. While this approach may not actually improve the overall accuracy of the predictions, it can be a favorable tradeoff because it may be more important to correctly classify every critical instruction, than to correctly classify every non-critical instruction. A second approach is to try to correlate the criticality of instructions with previous events. This basically an extension of techniques that have been used to increase branch prediction accuracy.

Handling Variability Correlation Local history Other histories:
Improved prediction accuracy Load stride Other histories: Branch Miss, Branch, Load Miss We found that using the local history of an instruction, that is, the pattern of previous critical and non-critical instances, resulted in a consistent improvement in critical path prediction accuracy. We also looked at several other type of information. We thought the history of branch direction, or the history of what branches had been mispredicted recently might affect what instructions were critical in the future, and also the pattern of recent loads that had missed, might also affect what instructions were critical. However, we did not find these to be as effective as local history. One reason why one might expect local history to be useful in predicting instruction criticality is that certain load instructions may be critical only when they miss, and further more, that load instructions may miss in certain repeating patterns. For example, a load instruction may hit 3 times in a cache line, and then miss the 4th time. In our studies, we did not model a processor with a stride-based load address predictor and prefetching. In a processor with such a capability, we would expect that many of these predictable load misses would be removed, and some of the benefit of local history based prediction would be lost. .

Handling Variability Example: Benchmark equake PC-index predictor
Correctly Predicts as critical 0% of critical instructions. Local-history predictor Correctly Predicts as critical 79% of critical instructions. PC-indexed predictor, biased counters Correctly Predicts as critical 86% of critical instructions Makes 5.8x as many “critical” predictions. As an example of where a two-level predictor is quite effective, consider the benchmark equake. Using just a basic PC-indexed predictor, with an unbiased counter, no critical instructions are correctly predicted as being critical. That is because almost no instruction is critical two times in a row and thus does not exceed the threshhold for being predicted by a simple two bit counter. Using a local-history predictor, which picks up on alternating patterns of critical and non-critical for a particular instruction, 79% of critical instructions are correctly identified as critical. Now, by using a biased counter which makes 8 critical predictions in a row after an instruction was identified as critical, you can identify 86% of critical instructions, but at a cost of predicting nearly 6 times as many instructions as being critical. Thus, while both the local history and the biased predictor are both pareto optimal, the local history predictor may have a strong advantage in practice.

Further work Better understand limits of critical path models.
Predict high-tautness for instructions in hardware. Apply optimizations to high-tautness instructions. Some further work. We would like to better understand the limits of models of the critical path, such as graph-based models. We would like to be able to predict what instructions have a high tautness value, and to Prioritize those instructions for access to certain microarchitectural optimizations.

Summary Built tool to study critical path.
Defined “Tautness” a measure of importance of critical instructions. Compared CP model with simulation. Highly accurate critical path prediction not possible with only PC-indexed predictors. Local history helps CP predictions. In summary, our contributions were to build a tool to study the critical path of programs, to propose a metric to prioritize among critical instructions, to compare proposed critical path models with detailed resimulation, we pointed out that highly accurate critical path prediction will not be possible with only pc-indexed predictors because static instructions vary in their criticality, and we showed that local history of criticality improves critical path prediction accuracy.

Quantifying Instruction Criticality

Similar presentations

Presentation on theme: "Quantifying Instruction Criticality"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quantifying Instruction Criticality

Similar presentations

Presentation on theme: "Quantifying Instruction Criticality"— Presentation transcript:

Similar presentations

About project

Feedback