Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stylometry Project May 4, 2007 Pace’s Research Day.

Similar presentations


Presentation on theme: "Stylometry Project May 4, 2007 Pace’s Research Day."— Presentation transcript:

1 Stylometry Project May 4, 2007 Pace’s Research Day

2 TEAM MEMBERS Rob Goodman, Programmer –Currently working at KPMG –Completing MS in Computer Science in December 2008 Matt Hahn, Quality Assurance –Currently working at Affiliated Computer Services, Inc. –Completing MS in in Information Technologies in May 2007 Madhuri Marella, Programmer –Completing MS in Computer Science in May 2007 Chris Ojar, Team Leader –Currently working at Pace’s Evening Support Office in Pleasantville –Completing MS in Internet Technologies in May 2007

3 WHAT IS STYLOMETRY? Unique linguistic styles and writing behaviors of individuals in order to determine authorship Used to attribute authorship to anonymous or disputed documents, and it has legal as well as academic and literary applications Uses statistical analysis, pattern recognition, and artificial intelligence techniques. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech

4 THE PROGRAM A pattern recognition system to identify the author of arbitrary email using stylometry features Phase 1 – Data Collection –Raw data from Keystroke Biometric Project –Plain text emails Phase 2 – Feature Extraction Measurements of punctuation, content format, and keystrokes [when applicable] Normalize features to 0-1 range Phase 3 – Classification k -Nearest-Neighbor using Euclidean distance Defaulted to 10

5 RAW DATA EXAMPLES File Name: Goodman-email.txt Dear Ms. Sanderson: I enjoyed our conversation on February 18th at the Family and Child Development seminar on teaching young children and appreciated your personal input about helping children attend school for the first time. This letter is to follow-up about the Fourth Grade Teacher position as discussed at the seminar. I will be completing my Bachelor of Science Degree in Family and Child Development with a concentration in Early Childhood Education at Pace in May of 2007, and will be available for employment at that time… File Name: Sandy-biometrics.txt

6 DIRTY DATA EXAMPLE I'm on my second take and I'm still writing about the same book : " A Million Little Pieces. " I'm not sure if I am supposed to be typing the same this ng I typed on submit ssion #1 as I am on sb ubmission #2, but since my sister is skiing in Vermont, I'll just continued. In any event, as a soon as I found out the book was not true, I couldn't pick it up for a few days. Then, it got the best of me. It is tu a fact that James Frey is a great ri writer. He holds your interest and attention a so I go t b past the fact the at he lied, and continued on. I have to say I endj joyed the book a lot better as a non-fiction book than I did as a fiction novel.

7 CLEAN DATA EXAMPLE I'm on my second take and I'm still writing about the same book: "A Million Little Pieces." I'm not sure if I am supposed to be typing the same thing I typed on submission #1 as I am on submission #2, but since my sister is skiing in Vermont, I'll just continue. In any event, as soon as I found out the book was not true, I couldn't pick it up for a few days. Then, it got the best of me. It is a fact that James Frey is a great writer. He holds your interest and attention so I got past the fact that he lied, and continued on. I have to say I enjoyed the book a lot better as a non-fiction book than I did as a fiction novel.

8 THE PROGRAM A pattern recognition system to identify the author of arbitrary email using stylometry features Phase 1 – Data Collection –Raw data from Keystroke Biometric Project –Plain text emails Phase 2 – Feature Extraction –Measurements of punctuation, content format, and keystrokes [when applicable] –Normalize features to 0-1 range Phase 3 – Classification k -Nearest-Neighbor using Euclidean distance Defaulted to 10

9 LIST OF 62 FEATURES MEASURED 32.Number of Accents 33.Number of Left curly braces 34.Number of Right curly braces 35.Number of Vertical lines 36.Number of Tildes 37.Number of Windows keys 38.Number of Up keys 39.Number of Left Shift keys 40.Number of Right Shift keys 41.Number of Page Down keys 42.Number of Insert keys 43.Number of Home keys 44.Number of End keys 45.Number of Down keys 46.Number of Ctrl keys 47.Number of Context menu keys 48.Number of Caps Lock keys 49.Number of Alt keys 50.Number of F12 keys 51.Number of Right keys 52.Number of Backspace keys 53.Number of Enter keys 54.Number of Delete keys 55.Number of Tab keys 56.Number of words 57.Number of sentences 58.Average words per sentence 59.Number of paragraphs 60.Average words per paragraph 61.Average word length 62.Number of sentences beginning with upper case 1.Number of sentences beginning with lower case 2.Number of White spaces 3.Number of exclamation points 4.Number of Number signs 5.Number of Dollar signs 6.Number of percent signs 7.Number of Ampersands 8.Number of Single quotes 9.Number of Left parentheses 10.Number of Right parentheses 11.Number of Asterisks 12.Number of Plus signs 13.Number of Commas 14.Number of Dashes 15.Number of Periods 16.Number of Forward slashes 17.Number of Colons 18.Number of Semi-colons 19.Number of Less than signs 20.Number of Equal signs 21.Number of Greater than signs 22.Number of Question marks 23.Number of multiple question marks 24.Number of multiple exclamation marks 25.Number of ellipsis 26.Number of At signs 27.Number of Left square brackets 28.Number of Back slashes 29.Number of Right square brackets 30.Number of Caret signs 31.Number of Underscores

10 THE PROGRAM A pattern recognition system to identify the author of arbitrary email using stylometry features Phase 1 – Data Collection –Raw data from Keystroke Biometric Project –Plain text emails Phase 2 – Feature Extraction –Measurements of punctuation, content format, and keystrokes [when applicable] –Normalize features to 0-1 range Phase 3 – Classification – k -Nearest-Neighbor using Euclidean distance Defaulted to 10

11 k -NEAREST NEIGHBOR USING EUCLIDEAN DISTANCE

12 CLASSIFICATION PHASE

13 DESIGN MODEL START READ RAW DATA Single Raw Data File? Email Reconstructed, Dirty File and Feature Stats Generated Yes Emails Reconstructed, Dirty Files and Feature Stats Generated in One File with File Name Saved as Batch.year-month-day and military time Run Compare END Do You Accept the Program’s Result? Save Test Case to Data Set? Select & Convert Base Data Files to… READ TEST CASE …DATA SET FILE Base Data Files of Email reconstructed, Dirty File and Feature Stats Generated with File Name Saved with the Extension of “- Clean Original.” Enter Author of Test Case No Compare to Test Case? Yes No

14 ANALYSIS MODEL Feature Extraction Feature Statistics Normalized Feature Statistics K Nearest Neighbor Classifier K Nearest Neighbor Identification START READ RAW DATA TEST CASE

15 PROJECT HOME PAGE http://utopia.csis.pace.edu/cs615/2006-2007/team2/

16 QUESTIONS Contact cojar@pace.edu cojar@pace.edu or ctappert@pace.edu for more information or visit http://utopia.csis.pace.edu/cs615/2006-2007/team2 ctappert@pace.edu http://utopia.csis.pace.edu/cs615/2006-2007/team2


Download ppt "Stylometry Project May 4, 2007 Pace’s Research Day."

Similar presentations


Ads by Google