Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Everything I wish I had known about research design and data analysis… Statlab Workshop Spring 2005 Heather Lord and Melanie Dirks.
Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.
A SOFTWARE TOOL DEVELOPED FOR THE CLASSIFICATION OF REMOTE SENSING SPECTRAL REFLECTANCE DATA Abdullah Faruque School of Computing & Software Engineering.
Automatic Authorship Identification Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Bayesian Models, Prior Knowledge, and Data Fusion for Monitoring Messages and Identifying Authors. Paul Kantor Rutgers May 14, 2007.
Ensemble Learning: An Introduction
Population Stratification with Limited Data By Kamalika Chaudhuri, Eran Halperin, Satish Rao and Shuheng Zhou.
Three kinds of learning
= == Critical Value = 1.64 X = 177  = 170 S = 16 N = 25 Z =
Distributed Representations of Sentences and Documents
Inferences About Process Quality
Feature Selection Which features work best? One way to rank features: –Make a contingency table for each F –Compute abs ( log ( ad / bc ) ) –Rank the log.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
12 August 2015 Chair for Computer Aided Medical Procedures & Augmented Reality Department of Computer Science | Technische Universität München Chair for.
Computational Analysis of USA Swimming Data Junfu Xu School of Computer Engineering and Science, Shanghai University.
TEXT CATEGORIZATION THE FEDERALIST – PART 3 Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
Statistics for Social and Behavioral Sciences Session #18: Literary Analysis using Tests (Agresti and Finlay, from Chapter 5 to Chapter 6) Prof. Amine.
Research Terminology for The Social Sciences.  Data is a collection of observations  Observations have associated attributes  These attributes are.
© Copyright McGraw-Hill CHAPTER 1 The Nature of Probability and Statistics.
IST Conference Paper Prototyping a Dynamic Software Interface: A Case Study Using APT Andrew Barrett Jamison Judd.
Element 2: Discuss basic computational intelligence methods.
Artificial Intelligence Lecture No. 29 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
Automatic Detection and Segmentation of Robot-Assisted Surgical Motions presented by Henry C. Lin Henry C. Lin, Dr. Izhak Shafran, Todd E. Murphy, Dr.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Software Testing. Software testing is the execution of software with test data from the problem domain. Software testing is the execution of software.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
How to start to write a scientific paper Ashgan Mohamed, Ph.D Assistant Professor Cairo University.
Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed.
CIKM Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 1 Department of.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
11/25/2015Slide 1 Scripts are short programs that repeat sequences of SPSS commands. SPSS includes a computer language called Sax Basic for the creation.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture notes 9 Bayesian Belief Networks.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Learning to Classify Documents Edwin Zhang Computer Systems Lab
Graphs We often use graphs to show how two variables are related. All these examples come straight from your book.
Objective: Students will learn the formal essay writing format. Bellwork: What is so important about the thesis statement?
Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab Christina Wallin, Period 3 Computer Systems Research Lab
Class Imbalance in Text Classification
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Data gathering (Chapter 7 Interaction Design Text)
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
TEI Workshop Digitization of Text 文字數位化 Reasons, Methods, Stages.
Distinguishing authorship
Authorship Attribution Using Probabilistic Context-Free Grammars
FINAL PRESENTATION Discovering Culture Patterns in OkCupid Database
Text Classification Seminar Social Media Mining University UC3M
Perceptrons Lirong Xia.
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Inferential Statistics
Hui Ping, Chuan Yin, Xuan Qi Group 5
Learning to Classify Documents Edwin Zhang Computer Systems Lab
Why study statistics?.
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
The Organizational Impacts on Software Quality and Defect Estimation
MONITORING MESSAGE STREAMS: RETROSPECTIVE AND PROSPECTIVE EVENT DETECTION Rutgers/DIMACS improve on existing methods for monitoring huge streams of textualized.
Automatic Handwriting Generation
Perceptrons Lirong Xia.
Presentation transcript:

Automatic Authorship Identification (Part II) Diana Michalek, Ross T. Sowell, Paul Kantor, Alex Genkin, David Madigan, Fred Roberts, and David D. Lewis

Acknowledgements Support –U.S. National Science Foundation DIMACS REU 2004 Knowledge Discovery and Dissemination Program Disclaimer –The views expressed in this talk are those of the authors, and not of any other individuals or organizations.

Outline I.Recap II.New Federalist Paper Results III.New Data Results IV.Conclusions and Future Work

The Authorship Problem Given: –A piece of text with unknown author –A list of possible authors –A sample of their writing Problem: –Can we automatically determine which person wrote the text?

The Authorship Problem Given: –A piece of text –A list of possible authors –A sample of their writing Problem: –Can we automatically determine which person wrote the text? Approach: –Use style markers to identify the author

The Federalist Papers 85 Total 12 Disputed

Previous Work: Mosteller and Wallace (1964) Function Words UponAlsoAn ByOfOn ThereThisTo AlthoughBothEnough WhileWhilstAlways ThoughCommonlyConsequently Considerable(ly)AccordingApt DirectionInnovation(s)Language Vigor(ous)KindMatter(s) ParticularlyProbabilityWork(s)

Our Previous Work: Trials with the Federalist Papers Wrote scripts in Perl and Python to compute –Sentence length frequencies –Word length frequencies –Ratios of 3-letter words to 2-letter words Analyzed our data with graphing and statistics software.

Previous Conclusions Not too helpful…but there is hope! –Try more features –Try different features

-

Feature Selection Which features work best? One way to rank features: –Make a contingency table for each feature F –Compute abs ( log ( ad / bc ) ) –Rank the log values ab cd F Madison Hamilton Not F

49 Ranked Features

Linear Discriminant Analysis A technique for classifying data Available in the R statistics package Input: –Table of training data –Table of test data Output: –Classification of test data

Linear Discriminant Analysis: example Input training data: upon 2-letter 3-letter M M M M M H H H H H upon 2-letter 3-letter Input test data: Ouput: m m m m h

Some more LDA results 12 to Madison: –upon, 1-letter, 2-letter –upon, enough, there –upon, there 11 to Madison: –upon, 2-letter, 3-letter < 6 to Madison –2-letter, 3-letter –there, 1-letter, 2-letter

Some more LDA results ClassOutput of lda Features tested 12 Mm m m m m m upon apt Mm m m m m m to upon Mm m m m m m h m m m m mon there Mh m m m m m m m m m m man by Mm m m m m m h m m m h mparticularly probability M m m m m m m h h h m h malso of M m m m h m m h h m m h malways of M h m m h m h h m h m m mof work M m m h m m m h h m h h hthere language M m h m h h m h h h m m hconsequently direction 5 11

Feature Selection Part II Which combinations of features are best for LDA? Are the features independent? We did some random sampling: –Choose features a, b, c, d –Compute x = log a + log b + log x + log d –Compute y = log (a+b+c+d) –Plot x versus y

Selecting more features What happens when more than 4 features are used for the lda? Greedy approach –Add features one at a time from two lists –Perform lda on all features chosen so far Is overfitting a problem?

First few greedy iterations 6 M 6 H h m h h m h m m h m h m 2-letter words 12 M 0 H m m m m m m upon 12 M 0 H m m m m m m 1-letter words 12 M 0 H m m m m m m 5-letter words 11 M 1 H m m m m m h m m m m m m 4-letter words 12 M 0 H m m m m m m there 12 M 0 H m m m m m m enough 11 M 1 H m m m m m m h m m m m m whilst 12 M 0 H m m m m m m 3-letter words 11 M 1 H m m m m m m h m m m m m 15-letter words

Listserv Data 70 Listerv archives Over 1 million messages Data was gathered by Andrei Anghelescu –

Our Data One Listserv, “CINEMA-L” 992 authors, messages We look at 3 authors –sstone 1077messages –thea –jmiles_2 1481

Frustration

Feature Selection How do we find “good” features?

More Frustration

A Measure of Variance

Summary of LDA Results Ran LDA using “I”, “is”, and “think” Trained on 80%, tested on 20% Correctly classified 122/186 documents

Future Work Finish our 3 author experiment Use more and different features –Structural – specific features Analyzing the relationship among features Other authorship id problems –Many authors –Odd-man-out

Thanks!!!