Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pattern Recognition Research Lab D. Lopresti & H. S. Baird Henry S. Baird Michael A. Moll Sui-Yu Wang A Highly Legible CAPTCHA that Resists Segmentation.

Similar presentations


Presentation on theme: "Pattern Recognition Research Lab D. Lopresti & H. S. Baird Henry S. Baird Michael A. Moll Sui-Yu Wang A Highly Legible CAPTCHA that Resists Segmentation."— Presentation transcript:

1 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Henry S. Baird Michael A. Moll Sui-Yu Wang A Highly Legible CAPTCHA that Resists Segmentation Attacks

2 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Some Typical CAPTCHAs AltaVista eBay/PayPal Yahoo! PARC’s PessimalPrint

3 Pattern Recognition Research Lab D. Lopresti & H. S. Baird All These Are Vulnerable to Segment-then-Recognize Attack Effective strategy of attack: Segment image into characters Apply aggressive OCR to isolated chars If it’s known (or guessed) that the word is ‘spellable’ (e.g. legal English), use the lexicon to constrain interpretations Patrice Simard (MS Research) et al report that this breaks many widely used CAPTCHAs

4 Pattern Recognition Research Lab D. Lopresti & H. S. Baird We try to generate word-images that will be hard to segment into characters Slice characters up: -vertical cuts; then -horizontal cuts Set size of cuts to constant within a word Choose positions of cuts randomly Force pieces to drift apart: ‘scatter’ horiz. & vert. Change intercharacter space

5 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Character fragments can interpenetrate Not only is it hard to segment the word into characters, …. … it can be hard to recombine characters’ fragments into characters

6 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Character fragments can interpenetrate Not only is it hard to segment the word into characters, …. … it can be hard to recombine characters’ fragments into characters

7 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Nonsense Words  We use nonsense (but English-like) words (as in BaffleText): generated pseudorandomly by a stochastic variable-length character n-gram model trained on the Brown corpus … this protects against lexicon-driven attacks  Why not use random strings? We want to help human readers feel confident they have made a plausible choice, so they’ll put up with severe image degradations (Cf. research in psychophysics of reading.) M. Chew & H. S. Baird, “BaffleText: a Human Interactive Proof,” Proc., 10 th SPIE/IS&T Document Recognition and Retrieval Conf., (DRR2003), Santa Clara, CA, January , 2003.

8 Pattern Recognition Research Lab D. Lopresti & H. S. Baird How Well Can People Read These? We carried out a human legibility trial with the help of ~60 volunteers: students, faculty, & staff at Lehigh Univ. plus colleagues at Avaya Labs Research

9 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Subjects were told they got it right/wrong – after they rated its ‘difficulty’

10 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Subjective difficulty ratings were correlated with objective difficulty People often know when they’ve done well This can be used to ensure that challenges aren’t too hard (frustrating, angering) Subjective difficulty level ALL Easy Impossible 5 No. of Challenges Percent answered correctly

11 Pattern Recognition Research Lab D. Lopresti & H. S. Baird The same data, graphically Right: Wrong : 1 Easy Impossible

12 Pattern Recognition Research Lab D. Lopresti & H. S. Baird People Rated These “Easy’ (1/5) aferatic memmari heiwho nampaign

13 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Rated “Medium Hard” (3/5) overch / ovorch wouwould atlager / adager weland / wejund

14 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Rated “Impossible” (5/5) acchown / echaeva gualing / gealthas bothere / beadave caquired / engaberse

15 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Why is ScatterType legible?  Does it surprise you that this is legible…?  I speculate that we can read it because: we exploit typeface consistency … the evidence is small details of local shape this ability seems largely unconscious

16 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Ensuring that ScatterType is Legible We mapped the domain of legibility as a function of engineering choices:  typefaces  characters in the alphabet  cutting & scattering parameters: cut fraction expansion fraction horizontal scatter mean vertical scatter mean h & v scatter variance character separation

17 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Some typefaces remain legible while others degrade quickly

18 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Raising Legibility by Pruning Typefaces

19 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Some Characters Quickly Become Confusable overch ‘ o ’ ‘ e ’ ‘ c ’ confusions

20 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Raising Accuracy by Omitting Characters

21 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Ensuring Legibility  Pruning characters & typefaces raised legibility in the top two difficulty levels to ~ 90%  Next step restrict the range of cutting & scatter parameters

22 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Mean Horizontal Scatter vs Mean Vertical Scatter Mirage: data analysis tool, Tin Kam Ho, Bell Labs. Right: Wrong : 1 Easy Impossible

23 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Cut Fraction Histogram Right: Wrong : 1 Easy Impossible

24 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Character Separation Histogram Right: Wrong : 1 Easy Impossible

25 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Finding Parameter Ranges for High Legibility d = Euclidean distance from origin of Mean Horiz Scatter vs Mean Vertical Scatter

26 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Guided by this Analysis, We Can Define Legibility Regimes Trivial: large cut fraction and small expansion Simple: character separation also decreases Easy: in original trial, correct 81% of time Medium Hard: larger scatter distances degrades legibility noticeably

27 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Other Examples - “Easy” “ wexped ” - difficult to segment ‘ e ’, ‘ x ’ and ‘ p ’. Shows difficulty of achieving 100% legibility “ veral ” - same parameters as above but different font. Not as difficult to segment

28 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Other Examples - “Too Hard” “ thern ” difficult to read, but easier than most with the same parameter values. Font makes a big difference. “ wezre ” satisfactorily illegible, though probably segmentable

29 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Next Steps  By judicious restrictions on engineering parameters, attempt to ensure human legibility better than 99.5%  Similarly, attempt to ensure 90% of challenges have low subjective difficulty ratings (e.g. 1-3 out of 5)  You are welcome to try out ScatterType: arcturus.cse.lehigh.edu/CAPTCHAs  Also, we invite you to attack it: We’ll send you large batches, with ground-truth Try to train a classifier to break it!

30 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Future Work  We have exhausted the experimental data from the 1 st trial  How can we automatically create images with given difficulty?  We have generated many images that seem difficult to segment automatically, but we don’t understand how to guarantee this  We need to understand the effects of typefaces on ScatterType legibility  We want to study character-confusion pairs more  Attacking ScatterType Testing on best OCR systems Invite attacks from other researchers Is it credible if we attack it ourselves, and fail?

31 Pattern Recognition Research Lab D. Lopresti & H. S. Baird Contacts Henry S. Baird Michael Moll


Download ppt "Pattern Recognition Research Lab D. Lopresti & H. S. Baird Henry S. Baird Michael A. Moll Sui-Yu Wang A Highly Legible CAPTCHA that Resists Segmentation."

Similar presentations


Ads by Google