A Billion Cycles a Day: Industrial Verification Matthew Heath Presentation to Synthesis & Verification Class May 8, 2003 Based on “Validating the Intel.

A Billion Cycles a Day: Industrial Verification Matthew Heath Presentation to Synthesis & Verification Class May 8, 2003 Based on “Validating the Intel Pentium 4 Microprocessor” by Bob Bentley, DAC 2001

How do you verify a design with...  42 million transistors  1 million lines of RTL code  600 – 1000 people working on it  A 3-year design time  Daily design changes

How do you verify a design which has bugs like this??  The FMUL instruction, when the rounding mode is set to “round up”, incorrectly sets the sticky bit when the source operands are: src1[67:0] = X*2 i+15 + 1*2 i src2[67:0] = Y*2 j+15 + 1*2 j where i+j = 54 and {X,Y} are integers

And the answer is...  Hire 70+ validation engineers  Buy several thousand compute servers  Write 12,000 validation tests  Run up to 1 billion simulation cycles per day for 200 days  Check 2,750,000 manually-defined properties  Find, diagnose, track, and resolve 7,855 bugs  Apply formal verification with 10,000 proofs to the instruction decoder and FP units  This found that obscure FMUL bug!

We know why validation is hard for tools. Why is it hard for people who run them?  To meet an aggressive tapeout schedule, design and validation must occur in parallel without one blocking the other.  Validation starts before the design is done  Design changes occur while validation tests are running  Both design and validation must continue in the presence of known, unfixed bugs

The design team  300 designers write RTL code  Refer to architectural spec, textbooks, research papers, conversations  Start with basic functionality and progressively add features according to project staging plan  Do simple self-checks along the way

The validation team  100 validators write RTL tests  Refer to same sources as designers, plus the RTL implementation itself  Write functional tests to exercise features as they’re implemented  Run tests on RTL simulator  Diagnose failures  File bug reports in central database

The management  Collect and analyze data  Pass/fail status of tests  Bug database statistics (counts, priority, age, discovery rate, fix rate, etc.)  RTL feature implementation progress  Compare trends with project schedule  Respond if necessary  Re-allocate resources to high-risk areas  Prioritize work

SRTL = “Structural” RTL  Boolean equations; no behavioral syntax  State-accurate  RTL state maps directly to schematic state  High-level constructs supported  Macros, constants, loops, vectors  Design hierarchy  Full-chip has 6 clusters  Each cluster has several units  Each unit has tens of functional blocks  Each block has O(10 4 ) transistors  Each designer owns several functional blocks

SRTL models  Cluster and full-chip level  Full-chip models consume ~1GB of disk space  Compiled, executable SRTL code  Source code  Test environments  Include emulation of external logic  Direct control over interface signals  Pre-defined sets of signals commonly selected for tracing during test debug  Library of useful test fragments

Most design work at cluster level  Decouples cluster and full-chip validation  Designers “graft” to latest cluster models  Check-out and edit selected source files  Incremental model build  Run validation tests  Revision control system  Designers check-in edited source files  Log messages include change descriptions, author, timestamp

Cluster model release process  Designers periodically turn-in selected checked-in versions of source files  Coordinated turn-ins sometimes necessary  Cluster model builders process turn-ins  Merge changes from different versions of the same source file included in multiple turn-ins  Compile an executable cluster SRTL model  Run tests provided by the validators  Report test failures to validators and designers for debug  Acceptable models released to design team for future grafts

Full-chip model release process  Same process, different hierarchy  Cluster model builders don designer’s hat  Graft to full-chip model  Edit based on changes to recent cluster models  Incremental full-chip model build  Run full-chip validation tests  Debug failures, full-chip turn-in  Now full-chip model builders take over...  Process turn-ins from all clusters  Run full-chip validation tests again!  Release full-chip models to design team

Netbatch  10 9 simulation cycles / day = 10 Hz * 10 5 sec/day * 10 3 computers  Netbatch manages compute server workload  For a given SRTL model and set of tests, create a job file and send it to netbatch  Each sub-team has a netbatch allocation  Jobs exceeding allocation enter wait queue  Wait times of 24 hrs + not uncommon  Test results  Pass/fail statistics  Failure time and meaningful error message  Traces of user-selected system state

Efficiency improvements  A SRTL change made by a designer...  Appears in a cluster model 1 week later  Appears in a full-chip model 2 weeks later  Validators find bugs in released models which the designer has already fixed  “Onion peeling” vs. “whack-a-mole” debug  Temporarily disabling failing properties  Releasing models which fail some tests  System state capture and restore

Central bug database  Released model version  Failing validation test & symptoms  Root cause  Requested design change  Priority  Log of discussion among designers, validators, and managers  Status / disposition  New, ETA, test fixed, design fixed (& version), validated, dropped

Bug root causes

Schematic formal verification  Use formal techniques because schematic simulation takes too long  Schematic design starts long before SRTL design is done  Bottom-up  Verify SRTL macros vs. library cells first  Black-box macrocells & verify block  Because SRTL is state-accurate, verification is combinational only!

One SRTL state may map to multiple functionally equivalent schem states Z = X & Y MSFF (Z, W, CLK) X QD Y W1 QD Z CLK W2 X QD Y W1 QD Z1 CLK W2 Z2 X QD Y CLK W Z

Retiming must be back-annotated into SRTL  Exception: Inverters Z = X & Y MSFF (Z, W, CLK) X QD Y W QD Z1 Z2 QD CLK X Z MSFF (X, Y, CLK) Z = ~Y Y

Conclusion  Efficient verification of large-scale designs is a daunting management challenge  Design and validation are concurrent, not iterative  Possible with adequate resources and powerful tools to use the resources efficiently  Methodology constraints keep the problem tractable  Clear communication among team  Careful documentation  Progress tracking is key to staying on schedule  Motto: “If it hasn’t been verified, it doesn’t work.”

How NOT to do verification... Arnold was unhappily aware that the complete Jurassic Park program contained more than half a million lines of code, most of it undocumented, without explanation... “What are you doing, John?” “Checking the code.” “By inspection? That’ll take forever.” - Michael Crichton, Jurassic Park

A Billion Cycles a Day: Industrial Verification Matthew Heath Presentation to Synthesis & Verification Class May 8, 2003 Based on “Validating the Intel.

Similar presentations

Presentation on theme: "A Billion Cycles a Day: Industrial Verification Matthew Heath Presentation to Synthesis & Verification Class May 8, 2003 Based on “Validating the Intel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Billion Cycles a Day: Industrial Verification Matthew Heath Presentation to Synthesis & Verification Class May 8, 2003 Based on “Validating the Intel.

Similar presentations

Presentation on theme: "A Billion Cycles a Day: Industrial Verification Matthew Heath Presentation to Synthesis & Verification Class May 8, 2003 Based on “Validating the Intel."— Presentation transcript:

Similar presentations

About project

Feedback