Presentation on theme: "SoC CAD 1 Tuning the Continual Flow Pipeline Architecture 徐 子 傑 Hsu,Zi Jei Department of Electrical Engineering National Cheng Kung University Tainan,"— Presentation transcript:
SoC CAD 1 Tuning the Continual Flow Pipeline Architecture 徐 子 傑 Hsu,Zi Jei Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C
NCKU SoC & ASIC Lab 2 Hsu, Zi Jei SoC CAD INTRODUCTION(1/5) To improve superscalar processor performance on difficult to parallelize applications, architects have been increasing the capacity of reorder buffers, reservation stations (RS), physical register files, and load and store queues  with every new out-of-order processor core. For two decades, increasing instruction buffer sizes has provided good performance improvement. However, this approach does not work anymore. A different design strategy is to size the instruction buffers to the minimum capacity necessary to handle the common case of L1 data cache hit and to use new scalable out-of-order execution algorithms to handle code that misses the L1 data cache.
NCKU SoC & ASIC Lab 3 Hsu, Zi Jei SoC CAD INTRODUCTION(2/5) The Continual Flow Pipeline architecture (CFP)  was proposed as an energy-efficient large instruction window architecture for reducing the impact of data cache misses on performance, without having to increase instruction buffers and physical register files sizes. CFP handles data cache misses as follows. When a load misses the data cache, a poison bit is set in the destination register of the load. Load dependent instructions in the reservation stations (RS) are then woken up, as if the load completed.
NCKU SoC & ASIC Lab 4 Hsu, Zi Jei SoC CAD INTRODUCTION(3/5) Poison bits propagate through instruction dependences, and identify all instructions that depend on the load miss and their descendants. The miss load and its dependents, identified by the poison bits in the ROB, pseudo-commit in program order and move from the ROB into a waiting buffer (WB) outside the pipeline. Since dependent instructions do not tie pipeline resources, the core can execute ahead far into the program without stalling due to the cache miss. When the miss data is fetched, the dependent instructions wake up and replay from the WB into the pipeline to complete their execution. When the WB is emptied and all miss dependent instructions complete, independent and dependent instruction results are merged using a flash copy operation in the retirement register file. Execution then resumes normally.
NCKU SoC & ASIC Lab 5 Hsu, Zi Jei SoC CAD INTRODUCTION(4/5) In that work, the miss independent and dependent instructions execute at different times, based on the timing of the load miss event and the data arrival event. Switching between the two executions is costly because it involves a pipeline flush, making this proposal unsuitable for L1 misses that hit the on-chip cache. In a more recent work, simultaneous CFP (S-CFP)  executes the independent and dependent instructions simultaneously to avoid the costly pipeline flush, thus making S-CFP more suitable for first level data cache misses.
NCKU SoC & ASIC Lab 6 Hsu, Zi Jei SoC CAD INTRODUCTION(5/5) In this paper, we use a novel virtual register renaming substrate  and fine tune the replay policies to mitigate excessive replays and rollbacks to the checkpoint. On previous CFP architectures, the miss load releases its renamed destination register when it pseudo-commits. This breaks the dependence links between the miss load and its dependents, requiring dependent thread to be renamed again to re-establish the dependence relation. In this work, we use virtual register names, which persist for the full life time of the instructions, to specify the dependences between instructions. It introduces an improved CFP policy that keeps miss dependent instructions in the reservation stations as long as they do not block the pipeline. However, when the instruction buffers become full with miss dependent instructions, thus stalling the pipeline, CFP moves the miss dependent instructions into the waiting buffer.
NCKU SoC & ASIC Lab 7 Hsu, Zi Jei SoC CAD Continual Flow Pipeline Architecture(1/3) S-CFP Architecture Overview Figure 1 shows a block diagram of S-CFP microarchitecture Unlike previous latency tolerant out-of-order architectures, the S- CFP core executes cache miss dependent and independent instructions concurrently using two different hardware thread contexts. S-CFP also has two retirement register file contexts (RRF), one for retiring miss independent instruction results and the other for retiring miss dependent instruction results. The independent hardware thread is the main execution thread. It is responsible for instruction fetch and decode of all instructions, branch prediction, memory dependence prediction, identifying miss dependent instructions and moving them into the waiting buffer (WB).
NCKU SoC & ASIC Lab 8 Hsu, Zi Jei SoC CAD Continual Flow Pipeline Architecture(2/3) At the end of dependent execution, when all the instructions from the WB have retired (i.e. committed) without any mispredictions or exceptions, the independent and dependent execution results are merged together with a flash copy of the dependent and independent register contexts within the retirement register file. The dependent thread execution starts when the load miss data is brought into the cache, waking up the load instruction in the WB, and continues until the WB empties. To maintain proper memory ordering of loads and stores from the independent and dependent threads execution, S-CFP uses load and store queues (LSQ), a Store Redo Log (SRL)  and a store-set memory dependence predictor .
NCKU SoC & ASIC Lab 9 Hsu, Zi Jei SoC CAD Continual Flow Pipeline Architecture(3/3) Figure 1. Simultaneous CFP architecture block diagram WB : waiting buffer RAT : register alias table RRF : register file contexts RS : reservation station SLR : store redo log LSQ : load and store queues
NCKU SoC & ASIC Lab 10 Hsu, Zi Jei SoC CAD S-CFP and Tuned CFP Execution Examples(1/10) Figure 2. Execution sequence showing S-CFP moving a dependent into WB eagerly X : load miss hit in DRAM A : load miss hit in L2 cache
NCKU SoC & ASIC Lab 11 Hsu, Zi Jei SoC CAD S-CFP Execution Examples: In Figure 2(a), the WB has a load miss X at the head waiting for its wakeup. Instruction A misses the first level cache and is marked as a potential candidate to be moved into the WB. When A reaches the head of the ROB, there are still free entries available in the ROB. Figure 2(b) shows that the load miss hits the L2 data cache and A is woken up from the L1 data cache shortly after it enters the WB. However, it is stuck in the WB behind instruction X that has missed to DRAM. For a long time afterwards, and until the miss data of load X is fetched from DRAM. S-CFP and Tuned CFP Execution Examples(2/10)
NCKU SoC & ASIC Lab 12 Hsu, Zi Jei SoC CAD Tuned CFP Execution Examples: Figures 2(c) and 2(d) show the ROB and WB states in the Tuned CFP architecture. Instruction A misses the first level cache and is marked as poisoned. However, unlike in S-CFP, it does not release its RS entry until it becomes a blocking instruction, just in case the load hits the L2 cache providing the miss data to the CFP core shortly. It will still be kept in the RS and ROB by stalling pseudo-retirement as long as there are free entries in the ROB and other instruction buffers for the pipeline to continue execution of other instructions without blocking. If the miss data arrives before the pipeline blocks, A is woken up from the RS and ROB by clearing its poison bits, as shown in Figure 2(d). A and its dependents do not require to go through the replay loop at all in this example, saving significant time delay and energy. S-CFP and Tuned CFP Execution Examples(3/10)
NCKU SoC & ASIC Lab 13 Hsu, Zi Jei SoC CAD S-CFP and Tuned CFP Execution Examples(4/10) Figure 3. Execution sequence showing a scenario leading to rollback in S-CFP which is avoided in tuned CFP X : load miss hit in DRAM A : load miss hit in L2 cache F : Branch(mispredicted) depend on A
NCKU SoC & ASIC Lab 14 Hsu, Zi Jei SoC CAD S-CFP Execution Examples: Figures 3(a)-3(c) show an execution sequence to illustrate a situation in S-CFP that leads to a rollback to the checkpoint. Instruction A misses the L1 cache. In this example, F is a branch that depends on A. A is moved into the WB from the head of the ROB as shown in Figure 3(a). F also follows A into the waiting buffer, even though the wakeup for A arrives while F is still in the ROB/RS, as shown in Figure 3(b). Both A and F are replayed behind instruction X as shown in Figure 3(c). On replay, branch F is found to be mispredicted and branch misprediction recovery has to be performed by rolling back execution to the checkpoint, since by then, the sequential state in the register file has been corrupted by the out-of-order pseudo-retirement of instructions during the cache miss processing. S-CFP and Tuned CFP Execution Examples(5/10)
NCKU SoC & ASIC Lab 15 Hsu, Zi Jei SoC CAD Tuned CFP Execution Examples: Figures 3(d) and 3(e) show how the rollback situation is avoided in the Tuned CFP architecture. Similar to the previous example, instruction A stays in the ROB, even if it reaches the head, as long as it is not blocking execution. A gets its wakeup before it moves into the WB, as shown in Figure 3(e). Even though F is a miss dependent and mispredicted branch, it executes before it pseudo-retires. When it reaches the head of the ROB, the ROB flushes the pipeline to clear all the wrong path instructions that have been fetched after the branch, and signals to the fetch unit to restart fetch and execution from the corrected target. The costly S-CFP branch recovery from the checkpoint has been avoided. S-CFP and Tuned CFP Execution Examples(6/10)
NCKU SoC & ASIC Lab 16 Hsu, Zi Jei SoC CAD S-CFP and Tuned CFP Execution Examples(7/10) Figure 3. Execution sequence showing a scenario leading to rollback in S-CFP which is avoided in tuned CFP A : load miss B : dependent on A x : don’t care
NCKU SoC & ASIC Lab 17 Hsu, Zi Jei SoC CAD S-CFP Execution Examples: Figures 4(a)-4(d) show another execution sequence to illustrate why S-CFP needs to replay a load and all its dependents once the load enters the WB. In this example, A is a load miss and B is dependent on A. The two instructions are separated by miss independents shown as dotted lines. In Figure 4(a), A reaches the head of the ROB. It pseudo-retires and moves into the WB, releasing all its pipeline resources including its ROB ID #3, as shown in Figure 4(b). When A wakes up and replays, it is allocated a new entry at the tail of the ROB as shown in Figure 4(c). Notice that A gets a new ROB ID #24 when it is reintroduced into the pipeline. S-CFP and Tuned CFP Execution Examples(8/10)
NCKU SoC & ASIC Lab 18 Hsu, Zi Jei SoC CAD S-CFP Execution Examples: Because of this new ID, even though B is still in the RS and the ROB while A is being replayed, A’s data writeback cannot wakeup B, because B still has the physical register destination ID #3 as its source operand. B reaches the ROB head, pseudo-retires, and moves into the WB. When B is replayed and reintroduced into the pipeline, it goes through the rename stage, gets a new ROB ID #28 and receives the correct physical source register ID # 24, re-establishing its link with A from the dependent RAT, as shown in Figure 4(d). S-CFP and Tuned CFP Execution Examples(9/10)
NCKU SoC & ASIC Lab 19 Hsu, Zi Jei SoC CAD Tuned CFP Execution Examples: Figures 4(e)-4(g) illustrate a partial replay in the Tuned CFP architecture, representing the same scenario discussed earlier in Figure 4(a). In Tuned CFP, virtual register IDs that are not associated with any physical locations are used for register renaming and in the RS wakeup and scheduling logic. The virtual register IDs of instructions A and B are shown under their ROB entries in addition to the renamed source and destination virtual register IDs. As before, when A reaches the head of the ROB, it pseudo-retires and moves into the WB, as shown in Figure 4(e). However, unlike in S-CFP, A only releases its RS but still carries its virtual register ID #3 along with it into the WB, as shown in Figure 4(f). Later on, when it wakes up and replays, A still carries with it its original virtual register ID #3, still maintaining its link with its dependent instruction B intact, allowing B to be woken up and scheduled by the RS without having to replay to be renamed again, as shown in Figure 4(g). S-CFP and Tuned CFP Execution Examples(10/10)
NCKU SoC & ASIC Lab 20 Hsu, Zi Jei SoC CAD Tuned CFP Architecture Overview(1/9) Figure 5 shows a block diagram of the Tuned CFP core. Tuned CFP uses Tomasulo’s algorithm  and reservation stations to perform data-driven, out-of-order execution. Like all other superscalar architectures, Tuned CFP uses a reorder buffer to commit instructions and update architecture register and memory state in program order. However, Tuned CFP does not use the reorder buffer for register renaming. Instead, it performs register renaming using virtual register IDs generated by a special counter. These virtual register IDs are not mapped to any fixed storage locations in the core, and therefore can be large in number and allocated to instructions throughout their life time, including miss dependent instructions evicted to the waiting buffers.
NCKU SoC & ASIC Lab 21 Hsu, Zi Jei SoC CAD Tuned CFP Architecture Overview(2/9) Virtual register renaming gives Tuned CFP a significant advantage over previous CFP architectures. Past CFP architectures require all miss dependent instructions to be replayed and renamed again to re-establish dependence links, which is necessary for the reservation stations to re-dispatch the miss dependent instructions in correct data flow order. In contrast, since the virtual register renaming IDs are permanent from the time the miss dependent instructions are renamed until they execute and commit, Tuned CFP can do partial replay of dependent instructions. What this means is that if the load miss data is fetched from memory after the load is moved to the waiting buffer but before its dependents have been moved, Tuned CFP replays only the load. This saves significant execution time that would be spent if all the miss dependent instructions in the reservation stations had to be replayed through the waiting buffer to be renamed again.
NCKU SoC & ASIC Lab 22 Hsu, Zi Jei SoC CAD Tuned CFP Architecture Overview(3/9) Figure 5. Tuned CFP block diagram WB : waiting buffer RAT : register alias table RRF : register file contexts RS : reservation station SLR : store redo log LSQ : load and store queues VID : Virtual Identification
NCKU SoC & ASIC Lab 23 Hsu, Zi Jei SoC CAD Tuned CFP Architecture Overview(3/9) Miss Independent Execution: When an L1 data cache load miss occurs, a poison bit is set in the destination reorder buffer entry of the load. Load dependent instructions in the reservation stations (RS) capture the poison bit from the common write back data bus. They are scheduled by the reservation stations control logic for pseudo-execution. Pseudo-execution of poisoned instructions does not actually use any execution units. However, pseudo-execution consumes RS dispatch ports and writeback bus cycles to propagate poison bits through instruction dependences and to identify all instructions in the reservation stations that depend on the load miss data. After pseudo-execution, miss dependent instructions stay in their reservation stations until they are waken up for real execution when the load miss data arrives, or until they are moved into the waiting buffer in case their resources are needed to unblock the execution pipeline and execute miss independent instructions.
NCKU SoC & ASIC Lab 24 Hsu, Zi Jei SoC CAD Tuned CFP Architecture Overview(4/9) Replay Loop and Miss Dependent Execution: Figure 5 shows the reduced replay loop in Tuned CFP consisting of two stages: the reservation stations (RS) and the waiting buffer (WB). The waiting buffer basically acts as a second level storage for the reservation stations. With virtual register renaming, entries can be freely evicted from the RS to the WB and then loaded back again to the RS to be scheduled for execution at a later time. Evicting miss dependent instructions to the WB on resource need basis significantly reduces the number of replayed instructions, especially in the case of medium latency load misses, which are those that miss the L1 data cache but hit the on-chip L2 cache. In case of a load miss to DRAM, it is often the case that the long miss latency causes the instruction buffers to fill up When the load miss is serviced, the miss load and its dependents are re-inserted from the waiting buffer back to the reservation stations, from which they are scheduled for execution.
NCKU SoC & ASIC Lab 25 Hsu, Zi Jei SoC CAD Tuned CFP Architecture Overview(5/9) Tuned CFP Reservation Stations: The poison bits propagate the dependences from L1 data cache misses to later instructions in the program to identify instructions that may encounter long data cache miss delays. These instructions are candidates to move to the waiting buffer to avoid pipeline stalls that could occur if any of the reservation stations, reorder buffer load queue or store queue arrays becomes full. Four conditions are checked to determine if an instruction should be moved to the waiting buffer: 1) the instruction is at the head of the RS list, 2) the instruction is poisoned, 3) one of the RS, reorder buffer, load queue or store queue arrays is full, and 4) every source operand of the instruction is either poisoned or ready. The last condition ensures that the miss dependent instructions carry their non-poisoned input values with them.
NCKU SoC & ASIC Lab 26 Hsu, Zi Jei SoC CAD Tuned CFP Architecture Overview(6/9) Waiting Buffer: The waiting buffer is a wide single ported SRAM array managed as a circular buffer using head and tail pointers. Miss dependent RS entries at the head of the RS array moves to the tail of the waiting buffer when any of the instruction buffers fills up due to data cache misses. When a data cache miss is completed, Tuned CFP replays the miss dependent entries by loading them back from the head of the waiting buffer to the tail of the RS. These replayed instructions do not need to be renamed again. Their virtual register renames are still valid, thus can be used by the RS to schedule these instructions and to grab their results from the writeback bus into the reservation stations of any dependent instructions, including any instructions that have not been replayed but still waiting in the RS.
NCKU SoC & ASIC Lab 27 Hsu, Zi Jei SoC CAD Tuned CFP Architecture Overview(7/9) Register File and Results Integration: Tuned CFP has a specialized register file for checkpointing register state at the load miss, for later use to handle miss dependent branch mispredictions and exceptions. Figure 6 shows Tuned CFP retirement register file cell with checkpoint flash copy support. Tuned CFP uses a flash copy of the RRF for creating checkpoints. In one cycle every independent RRF state bit (leftmost latch) is shifted into a checkpoint latch within the register cell (center latch). The register file can be restored from the checkpoint in one cycle by asserting RSTR_CLK. Tuned CFP register file cell contains one context bit for the dependent RRF state (rightmost latch). To integrate results back into one context, a restore cycle is performed from the dependent context into the independent context. However, not all registers are copied. Figure 6 shows that only poisoned registers are copied by using the poison bits to enable the clock of the copy operation.
NCKU SoC & ASIC Lab 28 Hsu, Zi Jei SoC CAD Tuned CFP Architecture Overview(8/9) Figure 6. RRF cell checkpoint and result integration
NCKU SoC & ASIC Lab 29 Hsu, Zi Jei SoC CAD Tuned CFP Architecture Overview(9/9) Load and Store Execution in Tuned CFP: To maintain proper memory ordering of loads and stores from the independent and dependent instructions execution, Tuned CFP, like previous CFP proposals, uses load and store queues (LSQ), a Store Redo Log (SRL)  and a store-set memory dependence predictor . All stores, dependent and independent, are allocated entries (and IDs) in the SRL in program order at the rename stage of the pipeline. Every load, dependent or independent, carries the SRL ID of the last prior store. In order to support concurrent, speculative execution of dependent and independent loads and stores, Tuned CFP L1 data cache has two new states: Speculative Independent (Spec_Ind) and Speculative Dependent (Spec_Dep). A block that is not in one of these two states is considered to be committed and would be in one of the states defined by the cache coherence protocol, e.g. MESI coherence protocol.
NCKU SoC & ASIC Lab 30 Hsu, Zi Jei SoC CAD SIMULATION RESULTS(1/7) We built our Tuned CFP architecture model on the Simplescalar ARM ISA simulation infrastructure (www.simplescalar.com) and used all 14 “C” benchmarks from SPEC 2000 and Spec 2006 that we succeeded in compiling using the Simplescalar cross compiler tool. Table 1 shows the simulated machine configurations. Table 2 and Table 3 show various relevant execution statistics of S-CFP and Tuned CFP. Figure 7 shows the speedup of Tuned CFP over S-CFP and over a similar sized conventional superscalar core.
NCKU SoC & ASIC Lab 31 Hsu, Zi Jei SoC CAD SIMULATION RESULTS(2/7) Table 1. Simulated machine configurations
NCKU SoC & ASIC Lab 32 Hsu, Zi Jei SoC CAD SIMULATION RESULTS(3/7) Figure 7. Tuned CFP speedup over baseline and S-CFP
NCKU SoC & ASIC Lab 33 Hsu, Zi Jei SoC CAD SIMULATION RESULTS(4/7) Table 2. Tuned CFP execution statistics
NCKU SoC & ASIC Lab 34 Hsu, Zi Jei SoC CAD SIMULATION RESULTS(5/7) Table 3. S-CFP and Tuned CFP replay, rollback and wrong path statistics
NCKU SoC & ASIC Lab 35 Hsu, Zi Jei SoC CAD SIMULATION RESULTS(6/7) Figure 8. Power increase of S-CFP and Tuned CFP over baseline
NCKU SoC & ASIC Lab 36 Hsu, Zi Jei SoC CAD SIMULATION RESULTS(7/7) Table 4. S-CFP and Tuned CFP power consumption relative to baseline averaged over all benchmarks
NCKU SoC & ASIC Lab 37 Hsu, Zi Jei SoC CAD CONCLUSION This paper presents a Tuned Continual Flow Pipeline architecture that uses virtual register renaming and optimized replay policies to improve performance and reduce replay loop circuit activity and checkpoint rollback execution compared to previous CFP designs. Our Tuned CFP architecture improves performance and power consumption over previous CFP architectures by ~15% and ~9%, respectively.