Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fighting Spam May Be Easier Than You Think Cynthia Dwork Microsoft Research SVC.

Similar presentations

Presentation on theme: "Fighting Spam May Be Easier Than You Think Cynthia Dwork Microsoft Research SVC."— Presentation transcript:

1 Fighting Spam May Be Easier Than You Think Cynthia Dwork Microsoft Research SVC

2 2 Why? Huge problem Industry: costs in worker attention, infrastructure Individuals: increased ISP fees Hotmail: huge storage costs, (was?) 65-85% FTC: fraud, confidence crimes Ruining e-mail, devaluing the Internet

3 3 Principal Techniques Filtering Everyone: text-based Brightmail: decoys; rules updates Microsoft Research: (seeded) trainable filters SpamCloud: collaborative filtering SpamCop, Osirusoft, etc: IP addresses, proxies, … Make Sender Pay Computation [Dwork-Naor92; Back97] Human Attention [Naor96, SRC patent] Money [Gates96] and Bonds [Vanquish]

4 4 Outline The Computational Approach (puzzles) Overview; economic considerations Burrows suggestion: memory-bound functions Social and Technical issues Turing Tests Architectures Point-to-Point & Amanuensis Centralized Ticket Server Deeper Discussion of Memory-Bound Functions 3 Approaches and Small-Space Cryptanalyses New Proposal (joint with A. Goldberg and M. Naor)

5 5 Computational Approach If I dont know you: Prove you spent ten seconds CPU time, just for me, and just for this message User Experience: Automatically and in the background Checking proof extremely easy All unsolicited mail treated equally

6 6 Economics (80,000 s/day) / (10s/message) = 8,000 msgs/day Hotmails billion daily spams: 125,000 CPUs Up front capital cost just for HM: circa $150,000,000 The spammers cant afford it.

7 7 Who Are the Spammers? "Most of the spammers are not wealthy people," said Stephen Kline, a lawyer for the New York State attorney general's office.

8 8

9 9 Who Are the Spammers? Spamhaus ROKSO 100/90% Circa 300 people total; very top few spammers make a few million/year (F. Krueger, SMN; also, see the recent articles about Alan Ralsky) Comparison: FastClick, with 30% of popunder market, has profit of $2 mil/yr (income of $4 mil/yr) Biggest publicly traded company: eUniverse ($6 mil profit); list of 10 8 good names Hit ratio: 10 -6 to 10 -5

10 10 Cryptographic Puzzles Hard to compute; f(S,R,t,nonce) cant be amortized lots of work for the sender Easy to check z = f(S,R,t,nonce) little work for receiver Parameterized to scale with Moore's Law easy to exponentially increase computational cost, while barely increasing checking cost Can be based on (carefully) weakened signature schemes, hash collisions Can arrange a shortcut for post office

11 11 Burrows Suggestion CPU speeds vary widely across machines, but memory latencies vary much less (10+ vs 4) So: design a puzzle leading to a large number of cache misses

12 12 Social Issues Who chooses f? One global f? Who sets the price? Autonomously chosen fs? How is f distributed (ultimately)? Global f built into all mail clients? (1-pass) Directory? Query-Response? (3-pass)

13 13 Technical Issues Distribution Lists (!) Awkward Introductory Period Old versions of mail programs; bounces Very Slow/Small-Memory Machines Can implement post office (CPU), but: Who gets to be the Post Office? Trust? Cache Thrashing (memory-bound) The Subverters

14 14 Turing Tests: Payment in Kind [N96;SRC] CAPTCHAs (Completely Automated Public T. test for telling Computers and Humans Apart) Defeat automated account generation 5-10% drop in subscription rate Teams of conjectured-humans (8-hour shifts) Yes: Distorted images of simple word No: ``Find 3 words in image with multiple overlapping images of words'' Others: subject classification, audio M. Blum: people have done preprocessing

15 15 Social and Technical Issues Social (especially in enterprise setting) ADA, S.508 (blind, dyslexic) Not ``professional'' Productivity cost: context switch (may be mitigated in architecture supporting pre-computation) Wrong direction: we should offload crap onto machines Technical No theory; if broken, cant scale. Idrive, AltaVista, broken [J] EZ Gimpy (Yahoo!) broken [M+]

16 16 Turing Test Economic Issues Human time in poor country has roughly same cost as computer time (even if 8/5 wk) Also need computer, but it can be slow, cheap 10s same ballpark as some Turing tests Suppose were wrong about cost (a few hundredths of a cent), and really the price should be 1 cent 1-cent challenge costs 2 minutes. No one not being paid would dream of submitting to this.

17 17 Cycle Stealing Stealing cycles from humans: Pornography companies require random users to solve a CAPTCHA before seeing the next image [vABL] Worse for computational challenges Economics: lots of currency means currency is devalued. Prices go up. There are lots of cycles, but anyone can buy them. Can go into business brokering cycles.

18 18 Point-to-Point Architecture (Ideal Message Flow) Single-pass send-and-forget Can augment with amanuensis to handle slow machines Can add post office / pricing authority to handle money payments Time mostly used as nonce for avoiding replays (cache tags, discard duplicates; time controls size of cache) Sender client S Sender client S Recipient client R Recipient client R m, f(S,R,t,nonce)

19 19 Here to There (and There ) Three e-mail messages Rs mail client caches m, S, R, t, nonce Bounce (viral marketing opportunity) html attachment with Java Script for f contains parameters for f (S, R, t,nonce) clicking on link causes computation, sending e-mail (optional) link for download of client software Ignorant Sender S Ignorant Sender S Spam-Protected Recipient R Spam-Protected Recipient R m bounce release m

20 20 Point-to-Point: Issues Social Senders trust code (transition period)? (Who gets to be a pricing authority?) Technical No pre-computation possible (slow machines) Function update requires new download Senders browser must be configured for sending e-mail (transition period)

21 21 Remark: Web-Based Mail Computation done by client (applet or JavaScript) Verification and safelist checking done by server

22 22 (Ideal Message Flow) Ticket kit = (#, puzzle) Ticket = (#, response) Any payment method Tickets may be accumulated in advance (pre-computation). Useful if price is high. Refunds (not shown) Centralization eases updates Ticket Server [ABBW] Sender Recipient Server Ticket Server Get Ticket Kit HTTP MSG + Ticket SMTP HTTP Ticket OK? 1,2 4,5 3

23 23 Social and Technical Issues Social Who gets to be a ticket server? Trust? Privacy? Federation? Trust code (transition period) Technical Complex; 5 flows (7 with refund) T may be on different continent than S, R Target of choice for subverters

24 24 Memory-Bound Functions [ABMW02] Devilish model! assume cache size (of spammers big machine) is half the size of memory (of weakest sending machine) must avoid exploitation of locality, cache lines (factors of 64 or even 16 matter) watch out for low-space crypto attacks

25 25 Meet in Middle (eg, using DES) Input: x, y partially specified K 1, K 2 Output: K 1, K 2 S i = {keys consistent with K i } Create table T with all E K1 (x), (so |T| = |S 1 |) 8 K 2 2 S 2 : compute z=(E K2 ) -1 (y) check if z 2 T Optimism: need lots of space to store T; wont fit in cache; hard to generate the zs in useful order DES x y K1K1 K2K2

26 26 Analysis of Meet-in-the-Middle Tradeoff: cache size c < |S 1 | yields time proportional to |S 1 |¢|S 2 |/c with no cache misses Sequentiality attacks [AB]: Once T is prepared, cache-sized blocks can be loaded sequentially. Luck: challenge may be resolved in cache. Example: if c = |S 1 |/4 then there is a ¼ chance of resolution with no further cache misses once the cache has been loaded with the first block of T.

27 27 Subset Sum Input: Random 2k-bit numbers a 1, …, a 2k, T Output: subset of a i s summing to T mod 2 2k Optimism: LLL lattice basis reduction dynamic programming: simultaneously O(2 k ) time and space Analysis: killed by Schroeppel and Shamir, Simultaneous T=O(2 k ), S=O(2 k/2 )

28 28 Easy Functions [ABMW02] (simplified construction) f: n bits to n bits, easy Given x k 2 range(f (k) ), find a pre-image with certain properties Hope: best solved by building table for f -1 and working back from x k Choose n=22 so f -1 fits in small memory, but not in cache Optimism: x k is root of tree of expected size k 2

29 29 Analysis of Easy Functions Vulnerable to T/S tradeoffs for inverting functions [Hellman; Fiat-Naor] TS 2 = O(N 2 )cpu f Cost of inverting f without memory accesses is only O(4) times cost of computing f (S=N/2); no cache misses! As clock speeds increase, must increase cpu f

30 30 Interpretation : Easy-Function challenges are cpu-bound, but their structure (the cpu-avoiding, table-lookup method) dampens the effect of Moores Law. Verifiers costs increase with Moores Law k ¢ cpu f cycles

31 31 Our Approach The sender makes a sequence of random memory accesses and sends a proof of having done so to the receiver, who needs a small number of accesses for verification. The memory access pattern leads to many cache misses, making the computation memory-bound. How to generate random-looking access pattern? What is an easily verifiable proof?

32 32 High Level Description Shared array of random numbers T A path is a sequence of inherently sequential accesses to T Sender must search a set S of paths to find a desired path Process is message/recipient dependent Sender sends a proof of finding desired path to receiver, who must verify Ratio of sender/receiver work is S

33 33 Incompressibility and RC4 Big (2 22 ) Fixed-Forever Truly Random T Model walk on prg of RC4 stream cipher Intuition: Use the pr bit stream (seeded by message) to choose entries of T to access. These, in turn, affect the stream (defending against pre- processing to order T so as to exploit locality).

34 34 RC4 (Partial Description) Initialize A to pseudo- random permutation of {1, 2, …, 256} using a seed K Initialize Indices: i = 0; j = 0 PR generation loop: i = i + 1 j = j + A[i] Swap (A[i], A[j]); Output z = A[A[i] + A[j]]

35 35 DGN: Initializing A and Basic Step Fixed-Forever truly random A of size 256; 32-bit words Given m and trial number i, compute strong cryptographic 256-word mask M. Set A = M © A. c = A mod 2 22 Initialize Indices: d=0; e=0 Walk: d = d + 1 e = e + A[d] A[d] = A[d] © T[c] Swap (A[d], A[e]) c = T[c] © A[A[e] + A[d]]

36 36 Side by Side Generate pseudo- random permutation of {1, 2, …, 256} using a seed K Fixed-Forever truly random A of size 256; 32-bit words Given m and trial number i, compute strong cryptographic 256-word mask M. Set A = M © A. c = A mod 2 22

37 37 Side by Side Initialize Indices: i = 0; j = 0 PR generation loop: i = i + 1 j = j + A[i] Swap (A[i], A[j]) Output z = A[A[i]+A[j]] Initialize Indices: d=0; e=0 Walk: d = d + 1 e = e + A[d] A[d] = A[d] © T[c] Swap (A[d], A[e]) c = T[c] © A[A[e] + A[d]]

38 38 Full DGN E: factor by which computation cost exceeds verification L: length of walk Trial i succeeds if hash of A after ith walk ends in 1+log 2 E zeroes (expected number of trials is E). Verification: trace the walk (L memory accesses)

39 39 Parallel Attack [ABMW] Process many paths in parallel; do memory accesses in sorted order to take advantage of cache lines Say attack succeeds if the number of occupied cache lines is ¾ the number of parallel paths (ie, reduce #lines by 1/3) Probabilistic analysis shows attack needs over 2 17 paths, but only 2 14 different As fit into 16MB cache

40 40 Parameters Want ELt=10s, where L = path length t = memory latency = 0.2 microseconds Reasonable choices: E = 24,000, L = 2048

41 41 Intuition for Why it Works Fight tradeoffs using incompressible T Fight preprocessing of T by using message+trial dependent A in computing walk Fight preprocessing of A by modifying using T Fight locality: choice of E and trial-dependence of A Good statistics for walks by analogy with statistics for heavily-studied RC4 (?)

42 42 Lower Bound for Abstract Version General formal model Abstraction of our algorithm using idealized hash functions Proof that amortized number of cache misses is linear in desired number

43 43 Summary Discussed computational approach, Turing tests, economics, cycle-stealing Briefly mentioned two architectures Examined difficulties of constructing memory-bound pricing functions and proposed a new one designed to avoid these difficulties Lower bounds for idealization.

44 44 Future Work and Open Questions Stanford undergrads just completed simple Outlook proxy and Unix filters; web- based implementation in progress; next steps: handle bounces, add signatures Distribution Lists Good MB functions with short descriptions (will subset product work)?

Download ppt "Fighting Spam May Be Easier Than You Think Cynthia Dwork Microsoft Research SVC."

Similar presentations

Ads by Google