Verifying remote executions: from wild implausibility to near practicality Michael Walfish NYU and UT Austin.

Verifying remote executions: from wild implausibility to near practicality Michael Walfish NYU and UT Austin

Acknowledgment Andrew J. Blumberg (UT), Benjamin Braun (UT), Ariel Feldman (UPenn), Richard McPherson (UT), Nikhil Panpalia (Amazon), Bryan Parno (MSR), Zuocheng Ren (UT), Srinath Setty (UT), and Victor Vu (UT).

The motivation is 3 rd party computing: cloud, volunteers, etc. We want this to be: 1. Unconditional, meaning no assumptions about the server 2. General-purpose, meaning arbitrary f 3. Practical, or at least conceivably practical soon “f”, x y, aux. clientserver check whether y = f(x), without computing f(x) Problem statement: verifiable computation

REJECTACCEPT “f”, x y clientserver y’... Theory can help. Consider the theory of Probabilistically Checkable Proofs (PCPs). [ ALMSS 92, AS 92 ] But the constants are outrageous  Under naive PCP implementation, verifying multiplication of 500×500 matrices would cost 500+ trillion CPU-years  This does not save work for the client

This research area is thriving. HOTOS 11 NDSS 12 SECURITY 12 EUROSYS 13 OAKLAND 13 SOSP 13 We have refined several strands of theory. We have reduced the costs of a PCP-based argument system [ IKO CCC 07] by 20 orders of magnitude. We predict that PCP-based machinery will be a key tool for building secure systems. CMT ITCS 12 TRMP HOTCLOUD 12 BCGT ITCS 13 GGPR EUROSYS 13 PGHR OAKLAND 13 Thaler CRYPTO 13 BCGTV CRYPTO 13 …. We have implemented the refinements.

(1)Zaatar: a PCP-based efficient argument (2) Pantry: extending verifiability to stateful computations (3) Landscape and outlook [ NDSS 12, SECURITY 12, EUROSYS 12] [ SOSP 13]

ACCEPT / R EJECT “f”, x clientserver y... The proof is not drawn to scale: it is far too long to be transferred. Zaatar incorporates PCPs but not like this: Even the asymptotically short PCPs seem to have high constants. We move out of the PCP setting: we make computational assumptions. (And we allow # of query bits to be superconstant.) [ BGHSV CCC 05, BGHSV SIJC 06, Dinur JACM 07, Ben-Sasson & Sudan SIJC 08 ]

client server... [ IKO CCC 07 ]... server client commit request commit response q 1  w q 2  w q 3  w … Zaatar uses an efficient argument [Kilian CRYPTO 92,95 ] : Instead of transferring the PCP … q 1, q 2, q 3, … PCPQuery(q){ return ; } ACCEPT / REJECT efficient checks [ ALMSS 92 ] queries

The server’s vector w encodes an execution trace of f(x). w f ( ) What is in w ? (1)An entry for each wire; and (2)An entry for the product of each pair of wires. x x0x0 x1x1 xnxn … 0 1 1 1 0 y0y0 y1y1 1 0 0 1 1 1 0 0 0 1 0 [ ALMSS 92 ]

... server client commit request commit response q 1  w q 2  w q 3  w q 1, q 2, q 3, … PCPQuery(q){ return ; } ACCEPT / REJECT queries This is still too costly (by a factor of 10 23 ), but it is promising. efficient checks Zaatar uses an efficient argument [Kilian CRYPTO 92,95 ] : [ IKO CCC 07 ] [ ALMSS 92 ]

y commit request commit response w ACCEPT / REJECT response scalars: q 1  w, q 2  w, … client server checks queries query vectors: q 1, q 2, q 3, …, x “f” Zaatar incorporates refinements to [ IKO CCC 07 ], with proof.

query vectors: q 1, q 2, q 3, … w (1) w (2) w (3) client server The client amortizes its overhead by reusing queries over multiple runs. Each run has the same f but different input x.

y (j) commit request commit response query vectors: q 1, q 2, q 3, … w (j) ✔ ACCEPT / REJECT response scalars: q 1  w (j), q 2  w (j), … client server checks queries, x (j) “f”

Boolean circuit Arithmetic circuit Arithmetic circuit with concise gates × × × × + + + abab  abab abab something gross Unfortunately, this computational model does not really handle fractions, comparisons, logical operations, etc.

if Y = 4 … … there is a solution Input/output pair correct constraints satisfiable. 0 = Z – 7 0 = Z – 3 – 4 Programs compile to constraints over a finite field (F p ). f(X) { Y = X − 3; return Y; } 0 = Z − X, 0 = Z – 3 – Y As an example, suppose X = 7. if Y = 5 … … there is no solution 0 = Z – 7 0 = Z – 3 – 5 dec-by-three.c compiler

How concise are constraints? Z 3 ← (Z 1 != Z 2 ) We replaced the back-end (now it is constraints), and later the front-end (now it is C, inspired by [Parno et al. OAKLAND 13 ] ). log |F p | constraints “Z 1 < Z 2 ” loops unrolled 0 = (Z 1 – Z 2 )  M – Z 3, 0 = (1 – Z 3 )  (Z 1 – Z 2 ) Our compiler is derived from Fairplay [ MNPS SECURITY 04 ] ; it turns the program into list of assignments (SSA).

if (X 1 < X 2 ) Y ← 3 else Y ← 4 M  {C < }, 0 = M  (Y – 3), (1 – M)  {C ≥ }, 0 = (1 – M)  (Y – 4) C < = 0 = B 0 (1 – B 0 ), 0 = B 1 (2 – B 1 ), …, 0 = B N–2  (2 N–2 – B N–2 ), X 1 – X 2 = (p – 2 N–1 ) + (B 0 +…+B N–2 ) Recall that the constraints are over F p. C < is satisfiable iff X 1 − X 2 is in the “negative” range [p − 2 N–1, p).

Y ← (X 1 != X 2 ) 0 = (X 1 – X 2 )  M – Y, 0 = (1 – Y)  (X 1 – X 2 ) Z 1 ← Z 2 0 = Z 1 – Z 2 program excerpt constraints Z 1 ← Z 2 + Z 3  Z 4 0 = Z 1 – Z 2 – Z 3  Z 4 Y ← X 1 NAND X 2 0 = 1 – X 1  X 2 – Y Y ← X 1 OR X 2 0 = (1 – X 1 )  ( X 2 – 1) + 1 – Y

Z 1 =23, Z 2 =187, …, w 1 0 1 0 1 0 1 0 w 1013 2047 187 23 219 0 1 805 The proof vector now encodes the assignment that satisfies the constraints. 0 1 1 1 0 1 0 0 1 = (Z 1 – Z 2 )  M 0 = Z 3 − Z 4 0 = Z 3 Z 5 + Z 6 − 5 The savings from the change are enormous.

“f”, x (j) y (j) commit request commit response query vectors: q 1, q 2, q 3, … w (j) ✔ ✔ ACCEPT / REJECT response scalars: q 1  w (j), q 2  w (j), … client server checks queries

w server We (mostly) eliminate the server’s PCP-based overhead. before: # of entries quadratic in computation size after: # of entries linear in computation size Now, the server’s overhead is mainly in the cryptographic machinery and the constraint model itself. The client and server reap tremendous benefit from this change.

|w|=|Z| 2 PCP verifier This resolves a conjecture of Ishai et al. [ IKO CCC07 ] Any computation has a linear PCP whose proof vector is (quasi)linear in the computation size. (Also shown by [ BCIOP TCC 13].) [ GGPR Eurocrypt 2013] linearity test quad corr. test circuit test commit request commit response queries responses π(q 1 ), …, π(q u ) π(  )= q 1, q 2, …, q u (z, z ⊗ z) (z, h) server client new quad.test |w|=|Z|+|C| w w

“f”, x (j) y (j) query vectors: q 1, q 2, q 3, … w (j) ✔ ✔ ✔ ACCEPT / REJECT response scalars: q 1  w (j), q 2  w (j), … ✔ client server checks queries commit request commit response

client We strengthen the linear commitment primitive of [IKO CCC07]. PCP tests q 1, q 2, …, q u π(q 1 ), …, π(q u ) This saves orders of magnitude in cryptographic costs. PCP verifier (q i, t i ) (π(q i ), π(t i )) π(t i ) = π(r i ) + α i  π (q i ) ? Enc(r i ) Enc(π(r i )) t i = r i + α i  q i π(t) = π(r) + α 1  π (q 1 ) + … + α u  π (q u ) t = r + α 1  q 1 + … + α u  q u ? π(  ) ? server (q 1, …, q u, t) (π(q 1 ), …, π(q u ), π(t)) Enc(r) Enc(π(r))

“f”, x (j) y (j) commit request commit response query vectors: q 1, q 2, q 3, … w (j) ✔ ✔ ✔ ACCEPT / REJECT response scalars: q 1  w (j), q 2  w (j), … ✔ client server checks queries ✔

Our implementation of the server is massively parallel; it is threaded, distributed, and accelerated with GPUs. Some details of our evaluation platform:  It uses a cluster at Texas Advanced Computing Center ( TACC )  Each machine runs Linux on an Intel Xeon 2.53 GHz with 48 GB of RAM.

Amortized costs for multiplication of 256×256 matrices: Under the theory, naively applied Under Zaatar client CPU time>100 trillion years1.2 seconds server CPU time>100 trillion years1 hour However, this assumes a (fairly large) batch.

1.What are the cross-over points? 2.What is the server’s overhead versus native execution? 3.At the cross-over points, what is the server’s latency?

native (slope: 50 ms/inst) Zaatar (slope: 33 ms/inst) verification cost (minutes of CPU time) instances of 150 x 150 matrix multiplication The cross-over point is high but not totally ridiculous.

The server’s costs are unfortunately very high.

cross-over25,000 inst.43,000 inst.210 inst.22,000 inst. client CPU21 mins.5.9 mins.2.7 mins.4.5 mins. server CPU12 months8.9 months22 hours4.2 months (1) If verification work is performed on a CPU (2) If we had free crypto hardware for verification … mat. mult. (m=150) Floyd-Warshall (m=25) root finding (m=256, L=8) PAM clustering (m=20, d=128) cross-over4,900 inst.8,000 inst.40 inst.5,000 inst. client CPU4 mins.1.1 mins.31 secs.61 secs. server CPU2 months1.7 months4.2 hours29 days

4 cores20 cores60 cores (ideal)60 cores Parallelizing the server results in near-linear speedup. matrix mult. (m=150) Floyd-Warshall (m=25) root finding (m=256, L=8) PAM clustering (m=20, d=128)

(1)The server’s burden is too high, still. (2)The client requires batching to break even. (3)The computational model is stateless (and does not allow external inputs or outputs!). Zaatar is encouraging, but it has limitations:

before: F, x y after: Pantry creates verifiability for real-world computations  C supplies all inputs  F is pure (no side effects)  All outputs are shipped back C S query, digest result C S F, x y C S RAM DB C map(), reduce(), input filenames output filenames SiSi

y (j) commit request commit response query vectors: q 1, q 2, q 3, … w (j) ACCEPT / REJECT response scalars: q 1  w (j), q 2  w (j), … client server checks, x (j) “f”

client server client server GGPR encoding arith. circuit F(){ [subset of C] } constraints (E) F, x y “E ( X=x,Y=y ) has a satisfying assignment” The compiler pipeline decomposes into two phases. 0 = X + Z 1 0 = Y + Z 2 0 = Z 1 Z 3 − Z 2 …. “If E ( X=x,Y=y ) is satisfiable, computation is done right.” = Design question: what can we put in the constraints so that satisfiability implies correct storage interaction?

Representing “load(addr)” explicitly would be horrifically expensive. How can we represent storage operations? Straw man: variables M 0, …, M size contain state of memory. B = M 0 + (A − 0)  F 0 B = M 1 + (A − 1)  F 1 B = M 2 + (A − 2)  F 2 … B = M size + (A − size)  F size Requires two variables for every possible memory address! B = load(A)

 They bind references to values  They provide a substrate for verifiable RAM, file systems, … [Merkle CRYPTO 87, Fu et al. OSDI 00, Mazières & Shasha PODC 02, Li et al. OSDI 04] How can we represent storage operations? (2) Consider content hash blocks: Key idea: encode the hash checks in constraints  This can be done (reasonably) efficiently Folklore: “this should be doable.” (Pantry’s contribution: “it is.”) digest block cli. serv. hash(block) = digest ?

d = hash(Z) add_indirect(digest d, value x) { value z = vget(d); y = z + x; return y; } y = Z + x We augment the subset of C with the semantics of untrusted storage  block = vget(digest): retrieves block that must hash to digest  hash(block) = vput(block): stores block; names it with its hash Server is obliged to supply the “correct” Z (meaning something that hashes to d).

constraints (E) cli serv QAP circuit subset of C + {vput, vget} C with RAM, search trees map(), reduce() client server F, x y Putting the pieces together [ Merkle 87 ] =  recall: “I know a satisfying assignment to E ( X=x,Y=y ) ”  checks-of-hashes pass satisfying assignment identified  checks-of-hashes pass storage interaction is correct  storage abstractions can be built from {vput(), vget()}

How can we represent storage operations? (Hint: consider content hash blocks: blocks named by a cryptographic hash, or digest, of their contents.) Srinath will tell you how.

The client is assured that a MapReduce job was performed correctly—without ever touching the data. in_digests map(), reduce(), in_digests out_digests The two phases are handled separately: mappers … … in = vget(in_digest); out = map(in); for r=1,…,R: d[r] = vput(out[r]) reducers … … for m=1,…,M: in[m] = vget(e[m]); out = reduce(in); out_digest = vput(out); … client MiMi RiRi

The client is assured that a MapReduce job was performed correctly—without ever touching the data. in_digests map(), reduce(), in_digests out_digests The two phases are handled separately: mappers … … reducers … … … client MiMi RiRi

Pantry baseline CPU time (minutes) number of nucleotides in the input dataset (billions) Example: for a DNA subsequence search, the client saves work, relative to performing the computation locally.  A mapper gets 600k nucleotides and outputs matching locations  One reducer per 10 mappers  The graph is an extrapolation

Pantry applies fairly widely  Privacy-preserving facial recognition query, digest result client server DB  Verifiable queries in (highly restricted) subset of SQL  Our implementation works with Zaatar and Pinocchio [Parno et al. OAKLAND 13]  Our implemented applications include:

We describe the landscape in terms of our three goals. Gives up being unconditional or general-purpose:  Replication [Castro & Liskov TOCS02], trusted hardware [Chiesa & Tromer ICS10, Sadeghi et al. TRUST10], auditing [Haeberlen et al. SOSP07, Monrose et al. NDSS99]  Special-purpose [Freivalds MFCS79, Golle & Mironov RSA01, Sion VLDB05, Michalakis et al. NSDI 07, Benabbas et al. CRYPTO11, Boneh & Freeman EUROCRYPT11] Unconditional and general-purpose but not geared toward practice:  Use fully homomorphic encryption [Gennaro et al., Chung et al. CRYPTO10]  Proof-based verifiable computation [GMR85, Ben-Or et al. STOC88, BFLS91, Kilian STOC92, ALMSS92, AS92, GKR STOC08, Ben-Sasson et al. STOC13, Bitansky et al. STOC13, Bitanksy et al. ITCS12]

Experimental results are now available from four projects. Pepper, Ginger, Zaatar, Allspice, Pantry HOTOS 11 NDSS 12 SECURITY 12 EUROSYS 13 OAKLAND 13 SOSP 13 CMT ITCS 12 Thaler et al. HOTCLOUD 12 Thaler CRYPTO 13 CMT, Thaler Pinocchio GGPR EUROCRYPT 13 Parno et al. OAKLAND 13 BCGTV BCGTV CRYPTO 13 BCGT ITCS 13 BCIOP TCC 13

applicable computations setup costs“regular”straightline pure, no RAM stateful, RAM general loops none (fast prover) Thaler [ CRYPTO 13] none CMT, TRMP [ ITCS, Hotcloud 12] low Allspice [Oakland13] medium Pepper [ NDSS 12] Ginger [Security12] Zaatar [Eurosys13] Pantry [ SOSP 13] high Pinocchio [Oakland13] Pantry [ SOSP 13] very high BCGTV [ CRYPTO 13] BCGTV [ CRYPTO 13] A key trade-off is performance versus expressiveness. better crypto properties: ZK, non-interactive, etc. better more expressive lower cost, less crypto

 Data are from our re-implementations and match or exceed published results.  All experiments are run on the same machines (2.7Ghz, 32GB RAM). Average 3 runs (experimental variation is minor).  Benchmarks: 150×150 matrix multiplication and clustering algorithm Quick performance comparison

The cross-over points can sometimes improve, at the cost of expressiveness.

10 11 The server’s costs are high across the board.

Summary of performance in this area  None of the systems is at true practicality  Server’s costs still a disaster (though lots of progress)  Client approaches practicality, at the cost of generality  Otherwise, there are setup costs that must be amortized  (We focused on CPU; network costs are similar.)

 Can we design more efficient constraints or circuits?  Can we apply cryptographic and complexity-theoretic machinery that does not require a setup cost?  Can we provide comprehensive secrecy guarantees?  Can we extend the machinery to handle multi-user databases (and a system of real scale)? Research questions:

 We have reduced the costs of a PCP-based argument system [Ishai et al., CCC07] by 20 orders of magnitude  We broaden the computational model, handle stateful computations (MapReduce, etc.), and include a compiler  There is a lot of exciting activity in this research area  This is a great research opportunity:  There are still lots of problems (prover overhead, setup costs, the computational model)  The potential is large, and goes far beyond cloud computing Summary and take-aways

Appendix Slides

PERFORMANCE COMPARISON

 A system is included iff it has published experimental results.  Data are from our re-implementations and match or exceed published results.  All experiments are run on the same machines (2.7Ghz, 32GB RAM). Average 3 runs (experimental variation is minor).  For a few systems, we extrapolate from detailed microbenchmarks  Measured systems:  General-purpose: IKO, Pepper, Ginger, Zaatar, Pinocchio  Special-purpose: CMT, Pepper-tailored, Ginger-tailored, Allspice  Benchmarks: 150×150 matrix multiplication and clustering algorithm (others in our papers) Experimental setup and ground rules

10 2 10 11 10 8 10 5 10 14 10 17 0 10 20 10 23 10 26 150×150 matrix multiplication Pepper GingerZaatar Pinocchio Ishai et al. (PCP-based efficient argument) verification cost (ms of CPU time) 50 ms 5 ms Verification cost sometimes beats (unoptimized) native execution.

Some of the general-purpose protocols have reasonable cross-over points. 1.6 days instances of 150x150 matrix multiplication

15K 30K 45K 60K 1.2B 450K 25.5K 50.5K 22K 4.5M matrix multiplication (m=150)PAM clustering (m=20, d=128) Ginger Zaatar Pinocchio 7.4K Ginger-tailored Allspice CMT 71 Ginger Zaatar Pinocchio Ginger-tailored Allspice CMT N/A cross-over point The cross-over points can sometimes improve with special-purpose protocols.

When Allspice is applicable, it has low cross-over points. Zaatar4400180210 Allspice47715 mat. mult. (m=128) poly. eval (m=512) root finding (m=256, L=8) But, of our benchmarks, CMT-improved does not apply to:  PAM clustering  Longest common subsequence  Floyd-Warshall

10 1 10 5 0 10 9 10 3 10 7 10 11 server’s cost normalized to native C matrix multiplication (m=150)PAM clustering (m=20, d=128) Pepper Ginger Pinocchio Zaatar Ginger-tailored CMT native C Pepper-tailored Allspice Pepper Ginger Pinocchio Zaatar Ginger-tailored CMT native C Pepper-tailored Allspice N/A The server’s costs are pretty much preposterous.

Ginger-tailored 10 2 10 8 10 14 0 10 20 10 11 10 5 10 17 10 23 server’s cost normalized to native C matrix multiplication (m=150)PAM clustering (m=20, d=128) Pepper Ginger Pinocchio IKO Zaatar Ginger-tailored CMT native C Pepper-tailored Allspice Pepper Ginger Pinocchio IKO Zaatar CMT native C Pepper-tailored Allspice N/A The server’s costs are pretty much preposterous.

Verifying remote executions: from wild implausibility to near practicality Michael Walfish NYU and UT Austin.

Similar presentations

Presentation on theme: "Verifying remote executions: from wild implausibility to near practicality Michael Walfish NYU and UT Austin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Verifying remote executions: from wild implausibility to near practicality Michael Walfish NYU and UT Austin.

Similar presentations

Presentation on theme: "Verifying remote executions: from wild implausibility to near practicality Michael Walfish NYU and UT Austin."— Presentation transcript:

Similar presentations

About project

Feedback