Download presentation
Published byChristy Killingbeck Modified over 9 years ago
1
Departments of Electrical Eng. & Computer Sc.
IFRA Instruction Footprint Recording & Analysis for Post-Silicon Bug Localization Sung-Boem Park Subhasish Mitra Robust Systems Group Departments of Electrical Eng. & Computer Sc. Stanford University 1
2
Key Message Post-silicon bug localization – Major bottleneck
Pinpoint from system failure Bug location, exposing stimulus Existing schemes – Expensive & not scalable IFRA – New technique for processors Eliminates limitations of existing techniques 96% accuracy 1% area, ~0% performance impact 2
3
Outline Motivation IFRA Overview Simulation Results Conclusion
4
Microprocessor Development Flow
Post-Silicon Validation Costs: 35% of Development Time 25% of Design Resources Design Pre-Silicon Pre-Silicon Verification POST-SILICON VALIDATION Post-Silicon Manufacturing Test “Post-silicon cost & complexity is rising faster than design cost” S. Yerramilli, VP, Intel, ITC06 Invited Address
5
Post-Silicon Validation Steps
Detect – Run test content in system e.g., OS, games, functional tests Localize – Pinpoint from system failure (e.g., crash) Bug location – e.g., ALU, decoder, scheduler Exposing stimulus – e.g., instruction sequence Dominates cost [Josephson DAC06] Root cause & Fix Optical probing, patch / circuit edit / respin
6
Post-Silicon Bug Types [Josephson DAC06]
Functional bugs – Incorrect logic implementation e.g., design errors Short localization time – e.g., hours to days Electrical bugs / circuit marginalities e.g., speed-path, noise, races, hold time Some voltage / temp / frequency corners LONG localization time – e.g., days to weeks Our focus 6
7
Existing Post-Silicon Bug Localization Flows
Reproduce failure on tester 2 days Localize on tester 3 days Not always Possible Tester-based Detect in system Detect in system System-based Localize failure in system 1 to 4 weeks Major Problems Failure Reproduction System-level simulation
8
IFRA vs. Existing Techniques
Trace buffers Clock manipulation Checkpoint + replay Scan techniques IFRA Intrusive? ? Yes No Failure reproduction? System-level simulation? Area impact? Yes 1% 8
9
Instruction Footprint Recording & Analysis
Design Phase Insert recorders inside chip design Non-intrusive No failure reproduction Single test run sufficient Record special info. in recorders / Run tests No Failure detected? Post-Si Validation Yes Scan out recorder contents No system simulation Self-consistency against test program binary Post-analyze offline Localized Bug: (location, stimulus)
10
Outline Motivation IFRA Overview Hardware Support
Automated Post-Analysis Techniques Simulation Results Conclusion
11
IFRA Hardware in Superscalar Processor
Branch Predictor I-Cache I-TLB Fetch Queue Pipeline Registers Decoders Reg Rename Phys Regfile Instruction Window 2xBr 2xALU MUL 2xLSU D-Cache D-TLB FPU Reorder Buffer Reg Map Reg Free FETCH Part of scan chain Post-Trigger Generator Recorders ID assignment Slow wire No at-speed routing Scan chain Alpha 21264 DECODE DISPATCH ISSUE EXECUTE COMMIT
12
Recording Operation Example
Special ID assignment rule Branch Predictor I-TLB I-Cache FETCH Fetch Queue ID Assignment INST2 Auxiliary Info: PC2 ID2 INST1 ID1 Auxiliary Info: PC1 Instruction Footprints Recorder 1 Pipeline Reg INST2 ID2 ID1 INST1 Auxiliary Info: PC2 ID2 ID1 Auxiliary Info: PC1 Decoder DECODE INST2 ID2 Auxiliary Info: Decoded bits2 ID1 INST1 Auxiliary Info: Decoded bits1 Recorder 2 Pipeline Reg ID1 INST1 ID2 INST2 ID2 Auxiliary Info: Decoded bits2 ID1 Auxiliary Info: Decoded bits1
13
Special Rule for Instruction ID Assignment
Simplistic ID assignment inadequate Speculation + flushes, out-of-order execution PC does not work for loops Special ID assignment rule – formal proof in paper ID width: log24n bits n = max. instructions in flight e.g., 8 bits for Alpha-like processor (n=64) No timestamp or global synchronization required 13
14
Instruction Footprint Recorder Design
Instruction ID + Auxiliary info. Dominated by memory Simple control logic Idle cycle compaction Circular buffer control Serialization Stop / Start recording No high-speed global routing Contents scanned out after failure detection Post-trigger signal Circular Buffer Control Logic To slow scan chain 14
15
What to Record? Total required storage for all recorders: 60 KBytes
Pipeline stage Auxiliary information Bits per recorder Number of recorders Fetch PC 32 4 Decode Decoding results Dispatch 2-bit residue of reg. name 6 Issue 3-bit residue of operands Execution (ALU, MUL) 3-bit residue of result 3 (Branch) None 2 (Load/Store unit) 32-bit memory address 35 Commit Exceptions ~0 Total required storage for all recorders: 60 KBytes
16
Post-Trigger Generation
Failure after 2 billion cycles (e.g., crash) Error after a billion cycles (e.g., speedpath) Too much storage overhead to store 1 billion cycles Code Execution time t=0
17
Post-Trigger Generation
Failure after 2 billion cycles (e.g., crash) Error after a billion cycles (e.g., speedpath) Need to capture in recorder storage Early failure detection necessary Code Execution time t=0 Early failure detection techniques (post-triggers) Classical error detection – residue, parity Deadlock & segfault detection Special early warnings to pause recording Details in paper
18
IFRA Area Impact 1% chip-level area impact
Synopsys Design Compiler synthesis Alpha like processor: 2MB L2 cache TSMC 130nm technology No global at-speed routing Area dominated by circular buffers in recorders Total recorder storage: 60 KBytes
19
Outline Motivation IFRA Overview Hardware Support
Post-Analysis Techniques Simulation Results Conclusion
20
Post-Analysis Overview
Test program binary Footprints from recorders Link footprints (Not covered today – Details in paper) Control-flow analysis Data-dependency analysis Decoding analysis Load/Store analysis Run high-level analysis Run low-level analysis Residue consistency check List of bug location-stimulus pairs
21
Linking Footprints from Recorder Contents
Test program binary Fetch-stage recorder Commit-stage recorder Execution-stage recorder PC6 PC5 PC4 PC3 PC2 PC0 INST6 INST5 INST4 INST3 INST2 INST0 ID: 7 ID: 6 ID: 5 ID: 4 AUX7 AUX6 AUX5 AUX4 AUX3 AUX2 AUX1 PC4 PC3 PC2 PC1 AUX17 AUX16 AUX15 AUX14 AUX12 AUX11 ID: PC5 … … … … … … PC7 INST7 ID: AUX8 ID: AUX18 time ID: PC4 ID: AUX13 PC1 INST1 … … ID: PC0 ID: AUX0 ID: AUX10 Special ID assignment rule ensures: Uncommitted instructions uniquely identified Relative orders of identical IDs maintained Even under flushes & out-of-order execution
22
Bug locations + exposing stimulus
Debug Example Link footprints ? ? High-level analysis ? ? Low-level analysis ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Bug locations + exposing stimulus
23
Debug Example – Decision 1
Test Program Binary Fetch-stage recorder … R0 R1 + R2 R0 R3 + R6 R5 R0 + R6 … Serial execution trace
24
Debug Example – Question 1
Residue of values mismatch? R0=3 Issue-stage recorder R0=5 Execute-stage … R0 R1 + R2 R0 R3 + R6 Producer of R0 RAW hazard R5 R0 + R6 Consumer of R0 … Serial execution trace
25
Debug Example – Question 2
Residue of phys. reg. names mismatch? R0=P5 Dispatch-stage recorder R0=P2 … R0 R1 + R2 R0 R3 + R6 Producer of R0 RAW hazard R5 R0 + R6 Consumer of R0 … Serial execution trace
26
Debug Example – Question 3
Residue of phys. reg. name match with previous producer? R0=P5 Dispatch-stage recorder … R0 R1 + R2 Previous producer R0=P5 R0 R3 + R6 Producer of R0 RAW hazard R5 R0 + R6 Consumer of R0 … Serial execution trace
27
Rest of modules in dispatch stage
Debug Example – Result Pipeline Register Bug Location R0 R1 + R2 R0 R3 + R6 R5 R0 + R6 … Decoder Stimulates Bug Arch. Dest. Reg Rest of pipeline reg. Read Circuit Write Circuit … Rest of modules in dispatch stage Propagates to failure … Reg. Mapping
28
Outline Motivation IFRA Overview Simulation Results Conclusion
29
Experimental Setup Simplescalar architectural simulator
Alpha configuration Augmented with ~1K error injection points Error model – single bit-flips Hard-to-repeat electrical bugs Both flip-flops & combinational logic Stimulus SpecInt 2000 benchmarks
30
Localization with candidates
Experimental Flow Short error latency? Yes Warm up for a million cycles Inject error Masked/ silent error No 100K simulation runs 800 post-analysis runs Any failure detected? Yes No Post-analyze Complete miss Localization with candidates Exact localization
31
Localization with avg. 6 candidates
IFRA Bug Localization Results Exact localization (78%) Correct localization (96%) Complete miss (4%) Localization with avg. 6 candidates (22%) Localization resolution Bug exposing stimulus One of 200 erroneous design blocks Avg. block size: 10K 2-input NAND gates
32
Outline Motivation IFRA Overview Simulation Results Conclusion
33
Conclusion IFRA Inexpensive 1% area, no expensive logic analyzers
No failure reproduction or system simulation Effective 96% accuracy Practical Alpha processor demonstration 33
34
Acknowledgement Bob Gottlieb, Intel Nagib Hakim, Intel
Ted Hong, Stanford University Doug Josephson, Intel Onur Mutlu, Microsoft Research Priyadarshan Patra, Intel Eric Rentschler, AMD Jason Stinson, Intel
35
Debug Example – Decision 4
Did they coexist in reorder buffer? … R0 R1 + R2 R0 R3 + R6 Producer of R0 More than n instructions in between RAW hazard R5 R0 + R6 Consumer of R0 … Serial execution trace
36
Debug Example – Low Level Analysis
R0 R1 + R2 R0 R3 + R6 R5 R0 + R6 R0 Pipeline Register Stimulus Decoder R0 Arch. Src. Reg R0 Arch. Dest. Reg R0 Rest R0 Bug Location P5 P2 Code Execution P5 Reg. Free List R0, R1, R2 R0, R3, R6 R5, R0, R6 5 Read Circuit Write Circuit Stimulus … 2 … R4 P2 5 R0 P5 Dispatch Stage Recorder (stores residue of phys.reg.) Reg. Mapping
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.