CAPTCHA solving Tianhui Cai Period 3
CAPTCHAs Completely Automated Public Turing tests to tell Computers and Humans Apart User is human or machine? Prevents spam on registration pages Audio and visual Visual – contains noise, distortions rotation translation scaling noise warp
Goal Solve a CAPTCHA, pretend to be a human Read the image – figure out what it says This has been done before. Show weaknesses of visual CAPTCHAs
Procedure Acquire image (from internet) Remove background clutter Segmentation (separating letters) Generate training/testing data set Letter identification (next section)
Procedure – cont’d Train on image data Test Review/Analyze
Implementation JAVA / ruby Acquire images – captchas.net formula to get actual text from image Remove background clutter – median filter, etc Segmentation – flood fill Letter identification – neural network
First quarter Three layer backpropagation neural network Neural network – good for classification Used often for image recognition Artificial neurons convert input to output Backpropagation used to let the neural network learn
Second quarter Image processing – Java ImageIO Noise removal Segmentation
Third quarter Neural network – made save / load saved into a text file a neural network can be trained multiple times Downloaded necessary images (ruby) captchas.net filename is what the image says Analyzed image outputs from images Cropped and centered segmented letters uniform letter size centered around bounding box uniformity is good for training
Fourth quarter Train and test Problem: it didn’t work resized images to 11x9, and then it worked Gather data / analyze learning rate number of training iterations Best around 92% success rate (when training data and testing data are separate) Higher when testing data is part of training data
Future? Segment letters that stick together (flood fill won’t work there) using vertical divides Shape context Conquer more complicated distortions and noise Reduce blobbyness / better noise removal Neural network optimizations (structure, number of nodes)