Presentation on theme: "On characterizing hardware platforms Ganesh Gopalakrishnan Lecture of 2-9-09, Week 5, CS 5966/6966."— Presentation transcript:
On characterizing hardware platforms Ganesh Gopalakrishnan Lecture of , Week 5, CS 5966/6966
2 Efficient Multiprocessors must have Efficient Shared Memory Systems * Hide the cost of memory operations by postponing updates * Increasingly important because CPUs are growing faster faster than memory systems are
3 Responsibilities of engineers (software or hardware) Be designing things to work Be striving to break own creation (before others do so!) Importance of message in the context of Collier’s work Hardware platforms are TOO complex We do not really understand what it does when it’s all built We need ways to register things against DESIGN INTENT ARCHTEST does it by running test programs on the raw hardware Test model-checking does what ARCHTEST does, except during model checking MPEC (our execution checker) does it by obtaining executions from somewhere and validating it (much like ARCHTEST’s simulation mode) Post-Silicon Verification : How silicon debugging is getting harder !!
4 How to build Efficient Shared-memory Multiprocessor Systems? Employ weak memory models –They permit global state updates to be postponed Employ aggressive shared memory consistency protocols –Weak memory models permit shared memory consistency protocols to be aggressive without undue complexity (no speculation, etc.) The focus of this talk is on weak memory models
5 Weak memory models allow multiple executions... Memory CPU st c,1 ; st d,2 ld d; ld c st c,1 ; st d,2 ld d, 2; ld c, 1 st c,1 ; st d,2 ld d, 2; ld c, 0 One possible execution... Another execution... Impossible under SC Possible under Itanium Possible under SC and under Itanium
6 Problems with Weak Memory Models Hard to understand (easy to misunderstand) P st [x] = 1 mf ld r1 = [y] R ld. acq r2 = [y] ld r3 = [x] Q st. rel [y] = 1 Is this legal under Itanium ? (no)
7 Post-Si verification of MP Orderings today (oversimplified) New MP System assembly program 1 assembly program n... assembly execution 1 assembly execution n Run repeatedly to catch one interleaving that might reveal bug Check every execution against ordering rules for compliance * This is done ad-hoc * How to make this formal and efficient ? * How to capitalize on repeated re-runs ?
8 Explanation of Illegal Executions (p 31 of Itanium App Note – search ) P st [x] = 1 mf ld r1 = [y] R ld. acq r2 = [y] ld r3 = [x] Q st. rel [y] = 1 us: mf: ul1: sr: la: ul2: US >> MF ; hence RVr(US) F(MF) MF >> UL1 ; hence F(MF) R(UL1) …many reasons… hence R(UL1) RVp(SR) If RVr(SR) R(UL1) and RVr(SR) UL1 RVp(SR), WB release atomicity of SR is violated, thus R(UL1) RVr(SR) …five lines of reasons Hence RVr(SR) R(LA) Since LA >> UL2, R(LA) R(UL2) Another para of reasons LV(Sr2) R(UL2) LV(SR1) RVp(SR1) RVq(SR1) F(MF1) R(UL1) RVq(SR2) RVp(SR2). But can’t allow due to atomicity of SR.
9 Checking Executions and Providing Explanations (present approach) P st [x] = 1 mf ld r1 = [y] R ld. acq r2 = [y] ld r3 = [x] Q st. rel [y] = 1 Published approaches are very labor-intensive paper-and-pencil proofs Clearly this can’t scale (6 instruction MP program takes 1-page of detailed mathematical proof What about the combinatorics of reasoning about 200 instructions? Approaches actually used within the industry involves the use of “checkers” Details of these checkers are unknown (How complete? How scalable?)
10 Our Approach Itanium Ordering rules written in Higher Order Logic Mechanical Program Derivation Checker Program Satisfiability Problem with Clauses carrying annotations Sat Solver SatUnsat Explanation in the form of one possible interleaving Unsat Core Extraction using Zcore P st [x] = 1 mf ld r1 = [y] R ld.acq r2 = [y] ld r3 = [x] Q st.rel [y] = 1 st [x] = 1 mf ld r1 = [y] ld. acq r2 = [y] ld r3 = [x] st. rel [y] = 1 Find Offending Clauses Trace their annotations Determine “ordering cycle” MP execution to be checked
11 Largest example tried to date (courtesy S. Zeisset, Intel) Proc 1 st8 [12ca20] = 7f869af546f2f14c ld r25 =  … 58 more instructions… st2 [7c2a00] = 4bca Proc 2 ld4 r24 = [733a74] st4.rel  = 96ab4e1f … 67 more instructions… ld8 r87 =  Initially the tool gave a trivial violation Diagnosed to be forgotten memory initialization Added method to incorporate memory initialization in our tool Our tool found the exact same cycle as pointed out by author of test Sat generation and Sat solving times need improving Cycle found thru our tool: st.rel (line 18, P1) ld (line 22, P2) mf ld (line 30, P2) st (line 11, P1)
12 Statistics Pertaining to Case Study Proc 1 st8 [12ca20] = 7f869af546f2f14c ld r25 =  … 58 more instructions… st2 [7c2a00] = 4bca Proc 2 ld4 r24 = [733a74] st4.rel  = 96ab4e1f … 67 more instructions… ld8 r87 =  All runs were on a GHz 1GB Redhat Linux V9 Athlon ~2 minutes to generate Sat instance 14,053,390 clauses 117,823 variables ~1 minute to solve Sat problem - found Unsat Unsat Core generation runs fast – gave 23 clauses! - 23 of the 14M clauses were causing the problem to be Unsat - Sat time for these 23 clauses … under a second Unsat Core’s annotations were traced back to offending instructions and the memory ordering rules that situated them in a “cycle”
13 The rest of the talk Our focus in this talk is on ARCHTEST and Post-Silicon Verification Some basic definitions How ARCHTEST does its job How ARCHTEST can be ported to model checking How Post-Silicon verification looms large in multi-core…
14 One standard way of specifying atomicity: All other events “e” are strictly before or strictly after the atomic set e Another standard way of specifying atomicity: If some event “e” is between two events in the atomic set, then “e” also belongs to the atomic set e e e
15 On ARCHTEST
16 Why test platforms? Need an “X-Ray machine” for the real hardware Provides uncanny insights into what is going on inside the memory subsystem of multiprocessors
17 How does ARCHTEST define memory models? Need an “X-Ray machine” for the real hardware Provides uncanny insights into what is going on inside the memory subsystem of multiprocessors MEM Model = A(CMP, X, Y, Z, …) CMP is Computational Ordering It is the ordering guaranteed by a correct uniprocessor PER ADDRESS (hazard behavior OK) operating under a correct cache coherence protocol X, Y, Z are other aspects of ordering, but for DIFFERENT memory addresses
18 How does ARCHTEST define memory models? A(CMP, PO) = a machine that guarantees program ordering An execution violating A(CMP,PO) –Initially (A,B,U,V) = (0,0,0,0) –Terminally (A,B,U,V) = (1,1,0,0) –Program –P1 P2 –A=1 ; B=1; –U=B ; V=A; Generalizing it to testing achieved by not assuming synchronized beginning…
19 How does ARCHTEST define memory models? P1P2 A=1;B=1; Y=B;X=A; A=2;B=2; Y=B;X=A; …… Violates PO iff one of these two: 1)X[j] < i /\ Y[i] < j, OR 2)X[j] > I /\ Y[i] > j Easy to see why by drawing a circuit of event dependencies…
20 How does ARCHTEST define memory models? For all intents and purposes, –Sequential consistency = A(CMP, PO, WA) –WA = write atomicity Using ARCHTEST we can run many tests at once It plots the degree of deviation in terms of the “lag” of data with respect to different addresses (data written by one processor; seen how late and how much out of order by the other…?) See an example in class…
21 Post-Silicon Verification under Limited Observability Ganesh Gopalakrishnan School of Computing, University of Utah, Salt Lake City, UT Ching Tsun Chou Intel Corporation, 3600 Juliette Lane, Santa Clara, CA Supported in part by NSF award CCR
22 Why Post-Silicon Verification? Why verify the silicon? Isn’t doing FV enough? (!) –FV cannot be applied to entire MP systems yet MP systems contain several CPUs and several “chip-sets” We cannot verify the silicon exhaustively - so why bother? –Formal analysis applied to particular executions can yield far more insights than ad hoc criteria applied to executions e.g. “Runtime Verification” of software (Havelund, Rosu, Lee,..)
23 Runtime verification can cover more! –1 GHz in silicon instead of 100 Hz during simulation –With well-designed “stress tests” one often finds out a lot Why Post-Silicon Verification?
24 Where Post-Si Verification fits in the Hardware Verification Flow Specification Validation Design Verification Testing for Fabrication Faults Post-Silicon Verification product Does functionality match designed behavior? Pre-manufacture Post-manufacture Spec
25 More Facts about Post-Silicon Verification Post-Si Verification can be for uniprocessor functionality.. or to determine if MP Orderings are being obeyed... or to check if cache coherence protocols are behaving Directly impacts the time to market The industry spends huge amounts of effort in this area Great opportunities to apply FV
26 How Formal Methods can enhance Post-Si Verification Reduces manual effort Helps in test-case selection Helps analyze execution results comprehensively
27 Overview of the talk How the paradigm for post-Si verification must change How Limited Observability impacts post-Si verification The use of Constraints A paper design for a Post-Si verification system based on constraints - based on actual experience developing prototypes in an industrial context Concluding Remarks
28 Post-Si Verification for Cache Protocol Execution PRESENT-DAY Assume there is a “front-side bus” Record bus transactions in response to test programs Generate detailed cache states from bus transactions See if behavior matches cache coherence protocol that was supposedly realized cpu …. mem “Front-side Bus”
29 Post-Si Verification for Cache Protocol Execution Future CANNOT Assume there is a “front-side bus” CANNOT Record all link traffic CAN ONLY Generate sets of possible cache states HOW BEST can one match against designed behavior? cpu Invisible “miss” traffic Visible “miss” traffic
30 Potential Carry-over of Techniques Runtime verification of distributed embedded systems Hundreds of processors, FPGAs, SoCs,... interacting Cannot assume system will work correctly on its own Must detect onset of crashes, intrusions,... EARLY Cannot easily observe all the nodes Even if observable, information corrupts - bandwidth limitations (need to compress / discard) - time uncertainties
31 Back to our specific problem domain... Verify the operation of systems at runtime when we can’t see all transactions Could also be offline analysis of a partial log of activities a b x y c d a x c d y b …
32 Possible Outcomes of Post-Si Verification Observed Behavior is “Definitely wrong” “Potentially dangerous” (rely on statistics to give this verdict?) “Worth noting” (based on past experience and bug logs?) ….. “Totally benign” (not even worth noting event) Caveat: we are partially observing a potentially incorrect system
33 Concrete example: Coherence Protocol Verification Requester Home Potential Owners …. req sreq sresp Retries or Completion Direct Supply of Data
34 Packet encodings, and example trace-file Req Home Users …. req sreq resp req / sreq Pkt_type midtidsenderdestaddrdata resp Pkt_type midtidsenderdestdata All the packets pertaining to a transaction share the same mid and tid Address not shipped with responses req first-snoop-reqsubseq-snoop-reqs subseq-snoop-resps Data Completion A transaction and various packets it may involve:
35 The actual trace-file is an interleaving of the packets of all active transactions: The actual trace-file analyzed looks something like this: The transactions may pertain to the same address (or not); many of the shown events may be missing… Individual transactions and their possible temporal overlap
36 Transaction (packet) semantics: Requester Potential Owners …. p p p p Each packet “p” can only be issued under certain cache-line states After issuing it, the cache-line state often changes After receiving a packet, the cache-line state changes These details are VERY complex, and often need to be extracted from cache protocol tables...
37 Verification consists of abstract interpretation driven by transaction history: c1c2 c3c4 c1c2 c3c4 c1c2 c3c4 c1c2 c3c4 c1c2 c3c4 Knowing transaction (packet) semantics, we can compute sets of possible states in which each cache line can be in after each packet goes by... (well, during offline analysis). Error is flagged when inconsistency is noted in sets of cache states.
38 General approach: Know all possible communication patterns of various transactions, and how to record progress along a particular pattern; use constraints to bridge gap. Communication patterns State within comm. pattern
39 How many of the packets can be invisible? At first cut (and based on some practical experience) having one missing in any “causal loop” seems tolerable – more than one appears TOO under-constrained. OK Not OK
40 General statements pertaining to invisibility OR In a “fork/join” situation, how many responses can be invisible? Generally there are invariants governing the responses (e.g., “at most one supplier of the value) If one response is invisible, we can assume it met the invariant -- and remember this to cross-check against future behavior If more than one response is invisible, we will have to increase the space of assumptions If we do not see a response, we have to delay “closing out” the transaction till another pertinent event involving the same address occurs
41 Verification of Mutual Exclusion of Resource Usage (proper arbitration): Possible idea: Assume that the “first snoop request” tells who won the arbitration Snoop of Check: Transaction 1 must “close-out” before transactions 2 and 3 are found to make progress Tr 1 Tr 2 Tr 3 Expected overlap of transactions under proper arbitration Problem: What if the first snoop request was on an invisible link?
42 Approach initially tried Wrote a prototype in Ocaml to analyze given cache protocol execution trace For each new packet read, its corresponding communication pattern and state within communication pattern was determined For each packet, we obtained WP and SP –WP : Weakest Precondition (in a sense) –The most general set of cache states under which packet could be generated –SP ; Strongest Postcondition (in a sense) –The tightest set of states the cache could be after the packet is sent –Many transaction-types and “conflict situations” made state maintenance and update highly unstructured (about 8 versions of the code were written, with each version becoming soon ugly)
43 A Conflict Scenario (for example) Requester Home Potential Owners …. req sreq sresp Retries or Completion Direct Supply of Data Requester issues “flush” packet Arbitration conflict at home Packet sent back for re-issue Meanwhile another request gets past home Home sends new request to requester New request “hijacks” flush-line away! Transaction never gets reissued
44 Constraints to the rescue.... but.... Constraint-programming was viewed as a possible solution – Would permit local behavior to be expressed in terms of constraints –Constraint formalisms can “solve” for missing information But, traditional constraint frameworks found inadequate – After extensive search, we could not find a constraint paradigm that can deal with interacting automata –What we need is a method for back-tracing precursors to observed actions –When multiple observations trace back to the same precursor, we can ‘vote the precursor up or down’ – Conditional probabilities of events are involved in guiding search
45 Approach being planned for implementation Given a packet, determine comm pattern and state within comm pattern Trace precursors along comm pattern till we reach origin of transaction (which is at a cache where the transaction missed and issued) Determine the cache state for the particular transaction using the WP rule for the packet
46 Approach being planned for implementation If cache state not previously determined, mark it speculative If cache state previously determined and present WP determines a compatible cache state, convert `speculative’ to committed If previously determined cache state is being contradicted by present WP, mark cache state unknown and trigger backtracing (cancel this precursor computation path and explore another)
47 Cache Agent that was a “responder” for one transaction may be “originator” for another.... Responder to two different transactions How two precursor computations may lead back in time to a common node and how we will have to “vote” its cache state (red deposits a speculative state - purple votes it up or down...)
48 Why today’s constraint approaches don’t give these capabilities readily.. Today’s constraint solving approaches (“CSP”) appear to be about “static” situations Various algorithms based on arc consistency and propagators can be found in the literature Temporal Concurrent Constraint Programming is in its infancy (I also don’t know much about these areas... tell me if I’m wrong! But I’ve not seen very much despite intense literature searches...) Constraint Solving in the context of Coupled Reactive Processes can be have multiple uses Environments such as Comet (van Hentenryck) may offer a powerful way to organize such a constraint-based system
49 Constraint Languages Surveyed (and some evaluated...) GnuProlog Sicstus Prolog Mozart / Oz Erlang FaCile.. or even Murphi perhaps? Reading List (Books / Papers...) Stuckey’s book on Constraint Logic Programming Dechter’s book on Constraints Modeler++ / Localizer++ / Comet Ultimately will roll our own constraint system
50 Concluding Remarks Limited Observability is going to be a central concern in future system verification Plenty of opportunities for formal methods, constraint-solving methods, and abstract interpretation methods to work in concert Formal Methods communities must talk to other communities to significantly enhance the scope and relevance of what they are doing –testing communities –diagnosis communities