Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis
Acknowledgements Support –U.S. National Science Foundation DIMACS REU 2004 Knowledge Discovery and Dissemination Program Disclaimer –The views expressed in this talk are those of the authors, and not of any other individuals or organizations.
Outline I.Recap II.New Federalist Paper Results III.New Data Results IV.Conclusions and Future Work
The Authorship Problem Given: –A piece of text with unknown author –A list of possible authors –A sample of their writing Problem: –Can we automatically determine which person wrote the text?
The Authorship Problem Given: –A piece of text –A list of possible authors –A sample of their writing Problem: –Can we automatically determine which person wrote the text? Approach: –Use style markers to identify the author
The Federalist Papers 85 Total 12 Disputed
Previous Work: Mosteller and Wallace (1964) Function Words UponAlsoAn ByOfOn ThereThisTo AlthoughBothEnough WhileWhilstAlways ThoughCommonlyConsequently Considerable(ly)AccordingApt DirectionInnovation(s)Language Vigor(ous)KindMatter(s) ParticularlyProbabilityWork(s)
Our Previous Work: Trials with the Federalist Papers Wrote scripts in Perl and Python to compute –Sentence length frequencies –Word length frequencies –Ratios of 3-letter words to 2-letter words Analyzed our data with graphing and statistics software.
Previous Conclusions Not too helpful…but there is hope! –Try more features –Try different features
-
Feature Selection Which features work best? One way to rank features: –Make a contingency table for each feature F –Compute abs ( log ( ad / bc ) ) –Rank the log values ab cd F Madison Hamilton Not F
49 Ranked Features
Linear Discriminant Analysis A technique for classifying data Available in the R statistics package Input: –Table of training data –Table of test data Output: –Classification of test data
Linear Discriminant Analysis: example Input training data: upon 2-letter 3-letter M M M M M H H H H H upon 2-letter 3-letter Input test data: Ouput: m m m m h
Some more LDA results 12 to Madison: –upon, 1-letter, 2-letter –upon, enough, there –upon, there 11 to Madison: –upon, 2-letter, 3-letter < 6 to Madison –2-letter, 3-letter –there, 1-letter, 2-letter
Some more LDA results ClassOutput of lda Features tested 12 Mm m m m m m upon apt Mm m m m m m to upon Mm m m m m m h m m m m mon there Mh m m m m m m m m m m man by Mm m m m m m h m m m h mparticularly probability M m m m m m m h h h m h malso of M m m m h m m h h m m h malways of M h m m h m h h m h m m mof work M m m h m m m h h m h h hthere language M m h m h h m h h h m m hconsequently direction 5 11
Feature Selection Part II Which combinations of features are best for LDA? Are the features independent? We did some random sampling: –Choose features a, b, c, d –Compute x = log a + log b + log x + log d –Compute y = log (a+b+c+d) –Plot x versus y
Selecting more features What happens when more than 4 features are used for the lda? Greedy approach –Add features one at a time from two lists –Perform lda on all features chosen so far Is overfitting a problem?
First few greedy iterations 6 M 6 H h m h h m h m m h m h m 2-letter words 12 M 0 H m m m m m m upon 12 M 0 H m m m m m m 1-letter words 12 M 0 H m m m m m m 5-letter words 11 M 1 H m m m m m h m m m m m m 4-letter words 12 M 0 H m m m m m m there 12 M 0 H m m m m m m enough 11 M 1 H m m m m m m h m m m m m whilst 12 M 0 H m m m m m m 3-letter words 11 M 1 H m m m m m m h m m m m m 15-letter words
Listserv Data 70 Listerv archives Over 1 million messages Data was gathered by Andrei Anghelescu –
Our Data One Listserv, “CINEMA-L” 992 authors, messages We look at 3 authors –sstone 1077messages –thea –jmiles_2 1481
Frustration
Feature Selection How do we find “good” features?
More Frustration
A Measure of Variance
Summary of LDA Results Ran LDA using “I”, “is”, and “think” Trained on 80%, tested on 20% Correctly classified 122/186 documents
Future Work Finish our 3 author experiment Use more and different features –Structural – specific features Analyzing the relationship among features Other authorship id problems –Many authors –Odd-man-out
Thanks!!!