Download presentation

Presentation is loading. Please wait.

Published byAllie Pruiett Modified over 3 years ago

1
Automatic Authorship Identification Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis

2
Acknowledgements Support –U.S. National Science Foundation Knowledge Discovery and Dissemination Program Disclaimer –The views expressed in this talk are those of the authors, and not of any other individuals or organizations.

3
The Authorship Problem Given: –A piece of text with unknown author –A list of possible authors –A sample of their writing Problem: –Can we automatically determine which person wrote the text?

4
The Authorship Problem Given: –A piece of text –A list of possible authors –A sample of their writing Problem: –Can we automatically determine which person wrote the text? Approach: –Use style markers to identify the author

5
Motivation and Applications Forensics Arts

6
Motivation and Applications Forensics –Unabomber Arts

7
Motivation and Applications Forensics –Unabomber Arts –Shakespeare

8
Motivation and Applications History E-mail

9
Motivation and Applications History –Federalist Papers E-mail

10
Motivation and Applications History –Federalist Papers E-mail

11
Motivation and Applications History –Federalist Papers E-mail

12
Motivation and Applications History –Federalist Papers 85 Total 12 Disputed E-mail

13
Motivation and Applications History –Federalist Papers 85 Total 12 Disputed E-mail

14
Motivation and Applications Counter-Terrorism

15
Motivation and Applications Counter-Terrorism –Osama Bin Laden

16
Previous Work: Mosteller and Wallace (1984) Function Words

17
Previous Work: Mosteller and Wallace (1984) Function Words UponAlsoAn ByOfOn ThereThisTo AlthoughBothEnough WhileWhilstAlways ThoughCommonlyConsequently Considerable(ly)AccordingApt DirectionInnovation(s)Language Vigor(ous)KindMatter(s) ParticularlyProbabilityWork(s)

18
Previous Work: Mosteller and Wallace (1984) Function Words UponAlsoAn ByOfOn ThereThisTo AlthoughBothEnough WhileWhilstAlways ThoughCommonlyConsequently Considerable(ly)AccordingApt DirectionInnovation(s)Language Vigor(ous)KindMatter(s) ParticularlyProbabilityWork(s) w k = number times word k appears in text T = (w 1, w 2, …, w 30 )

19
Previous Work: Mosteller and Wallace (1984) Bayesian Inference

20
Previous Work: Mosteller and Wallace (1984) Bayesian Inference Odds(1, 2 | x) = (p 1 /p 2 )[f 1 (x)/f 2 (x)] Final odds = (initial odds)(likelihood ratio)

21
Previous Work: Mosteller and Wallace (1984) Experiment –Use 18 Hamilton and 14 Madison papers to gather information Results

22
Previous Work: Mosteller and Wallace (1984) Experiment –Use 18 Hamilton and 14 Madison papers to gather information –Test: known Hamilton papers, disputed papers Results

23
Previous Work: Mosteller and Wallace (1984) Experiment –Use 18 Hamilton and 14 Madison papers to gather information –Test: known Hamilton papers, disputed papers Results –Strong odds in favor of Hamilton for other known Hamilton papers –Strong odds in favor of Madison for all disputed papers

24
Previous Work: Corney (2003) Analyzed email data to determine: –minimum message length –minimum number of messages needed to model an authors’ style –which stylometric features can be used to determine authorship

25
Previous Work: Corney (2003) Stylometric features –Proportion of white-space –Punctuation patterns –Function word frequencies –Frequency of 2-grams –Email-specific features Greetings, signatures, html tags

26
Previous Work: Corney (2003) Conclusions: –Authorship attribution can be successfully performed –200-250 words is enough –20 data points is enough for training –Best feature: function words –Not so great: 2-grams

27
Our Work: Trials with the Federalist Papers Wrote scripts in Perl and Python to compute –Sentence length frequencies –Word length frequencies –Ratios of 3-letter words to 2-letter words Analyzed our data with graphing and statistics software.

28
Sentence Length Frequencies Step 1: Parsing the text –What constitutes a sentence? “Mrs. Jones is has been working on her Ph.D. for 8.5 years.” “I said no.” “Take the no. 7 bus downtown.” “What are you talking about ?!?!?!?!!” “Sometimes….I just feel…anxious.”

29
Sentence Length Frequencies Step 2: Obtain sentence length data iMH 110 200 300 410 592 666 7147 8226 91614 iMH 101921 111520 ……… 302621 312816 322628 ……… 17301 20110 i - sentence length M - Number of length-i sentences in known Madison papers (1139 sentences) H - Number of length-i sentences in known Hamilton papers (1142 sentences)

30
Sentence Length Frequencies Step 3: Graph the data

32
Sentence Length Distributions Step 4: Does the data show a difference between Madison and Hamilton? –View sentence lengths as sample data taken from two distributions –Apply the Kolmogorov-Smirnov test

33
Kolmogorov-Smirnov Test Input: –Two vectors of data values, taken from a continuous distribution. Method: –Examines maximal vertical distance between empirical cumulative distribution curves Output: –p-value AB 14 46 32 87 51AB 14 510 812 1619 2120

34
Kolmogorov-Smirnov Test Results of step 4: –p-value for sentence length frequency data is… 0.5121

35
Kolmogorov-Smirnov Test Results of step 4: –p-value for sentence length frequency data is… Not too helpful…but there is hope! –Try more features –Try different features 0.5121

38
Future Work Examine email data Build our own authorship-identification tool Test new stylometric features for distinguishing ability

Similar presentations

OK

Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis.

Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To ensure the functioning of the site, we use **cookies**. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy & Terms.
Your consent to our cookies if you continue to use this website.

Ads by Google

Ppt on ms access 2007 Ppt on grease lubrication for o-rings Ppt on applied operations research va Ppt on double input z-source dc-dc converter Ppt on nuclear family and joint family Ppt on 3g wireless networks Ppt on stock market in india Ppt on area and perimeter of rectangle Ppt on amplitude modulation and demodulation Ppt on edge detection in matlab