Presentation is loading. Please wait.

Presentation is loading. Please wait.

Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan DEDEOĞLU 7 April 2006.

Similar presentations


Presentation on theme: "Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan DEDEOĞLU 7 April 2006."— Presentation transcript:

1 Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan DEDEOĞLU 7 April 2006

2 CS533 Information Retrieval Systems 2 Outline Overview What is Authorship Attribution? Brief History Where and How to use it? Stylometry Style Markers Classification Methods Naïve Bayes Support Vector Machine k-Nearest Neighbor

3 7 April 2006CS533 Information Retrieval Systems 3 What is Authorship Attribution? The way of determining who wrote a text when it is unclear who wrote it. It is useful when two or more people claim to have written something or when no one is willing (or able) to stay that (s)he wrote the piece In a typical scenario, a set of documents with known authorship are used for training; the problem is then to identify which of these authors wrote unattributed documents.

4 7 April 2006CS533 Information Retrieval Systems 4 A Brief History The advent of non-traditional authorship attribution techniques can be traced back to 1887, when Mendenhall first created the idea of counting features such as word length. His work was followed by work from Yule (1938) and Morton(1965) with the use of sentence lengths to judge authorship

5 7 April 2006CS533 Information Retrieval Systems 5 Where to use it? Authorship Attribution can be used in a broad range of applications To analyze anonymous or disputed documents/books, such as the plays of Shakespeare (shakespeareauthorship.com) Plagiarism detection - it can be used to establish whether claimed authorship is valid.

6 7 April 2006CS533 Information Retrieval Systems 6 Where to use it? (Cont’d) Criminal Investigation - Ted Kaczynski was targeted as a primary suspect in the Unabomber case, because authorship attribution methods determined that he could have written the Unabomber’s manifesto Forensic investigations - Verifying the authorship of s and newsgroup messages, or identifying the source of a piece of intelligence.

7 7 April 2006CS533 Information Retrieval Systems 7 Motivation So many publications existed, but no detailed work has been given for Turkish literature Idea Originated from: “Kayıp Yazarın İzi, Elias’ın Gizi” by S. Oğuzertem Our work is going to support his idea?

8 7 April 2006CS533 Information Retrieval Systems 8 How to do it? When an author writes they use certain words unconsciously. Find some underlying ‘fingerprint’ for an authors style. The fundamental assumption of authorship attribution is that each author has habits in wording that make their writing unique.

9 7 April 2006CS533 Information Retrieval Systems 9 How to do it? (Cont’d) It is well known that certain writers can be quickly identified by their writing style. Extract features from text that distinguish one author from another Apply some statistical or machine learning technique given training data Showing examples and counterexamples of an author's work

10 7 April 2006CS533 Information Retrieval Systems 10 How to do it – Problems? Highly interdisciplinary area Expertise in linguistics, statistics, text authentication, literature? Too many style measures to apply? Statistical method – complicated or so simple? Also too many exist in the literature as well

11 7 April 2006CS533 Information Retrieval Systems 11 How to do it? (Cont’d) Determine style markers. Parse all of the documents and extract the features Combine the results in order to get certain characteristics about the authors Apply each of the statistical/machine learning approaches to assign a given document to the most likely author.

12 7 April 2006CS533 Information Retrieval Systems 12 Stylometry The science of measuring literary style What are the distinguishing styles? Study the rarest, most striking features of the writer? Study how writers use bread-and-butter words (e.g. "to", "with" etc. in English)?

13 7 April 2006CS533 Information Retrieval Systems 13 Stylometry "People's unconscious use of everyday words comes out with a certain stamp", David Holmes - stylometrist at the College of New Jersey "Rare words are noticeable words, which someone else might pick up or echo unconsciously. It's much harder for someone to imitate my frequency pattern of 'but' and 'in'.", John Burrows - emeritus English professor of the University of Newcastle in Australia

14 7 April 2006CS533 Information Retrieval Systems 14 Style Markers in Our Study Frequency of Most Frequent Words Token and Type Lengths Token: All words Type: Unique words For the sentence “I cannot bear to see a bear” 7 tokens, 6 (context-free) types Sentence Lengths Syllable Count in Tokens Syllable Count in Types

15 7 April 2006CS533 Information Retrieval Systems 15 Style Markers in General Some commonly used style markers Average sentence length Average syllables per word Average word length Distribution of parts of speech Function word usage The Type-Token ratio Word frequencies Vocabulary distributions

16 7 April 2006CS533 Information Retrieval Systems 16 Test Set

17 7 April 2006CS533 Information Retrieval Systems 17 Test Set

18 7 April 2006CS533 Information Retrieval Systems 18 Test Set

19 7 April 2006CS533 Information Retrieval Systems 19 Test Set

20 7 April 2006CS533 Information Retrieval Systems 20 Classification Methods How the style markers are used? Several methods exist such as k-NN (k Nearest Neighbor) Bayesian analysis SVM (Support Vector Machines) PCA (Principal Components Analysis) Markovian Models Neural Networks Decision Trees We are planning to use Naïve Bayes SVM K-NN

21 7 April 2006CS533 Information Retrieval Systems 21 Naïve Bayes Approach In general each style marker is considered to be a feature or a feature set Existing text whose author is known is used for training Several choices are possible to find out the distributions of the feature values in a text with a known author such as Maximum likelihood estimation Bayes Density Estimation Maximization-Estimation etc.

22 7 April 2006CS533 Information Retrieval Systems 22 Naïve Bayes Approach Values of the features (x) for the unattributed text is found Since the probability densities are known for each author, Bayes formula is used to find the author of the “anonymous” text A * = argmax A i (P(A i |x) = p(x|A i ) P(A i ))

23 7 April 2006CS533 Information Retrieval Systems 23 An Oversimplified Sample Scenario Assume that There are texts from two authors (two classes) As the style marker only the number of words with 3 characters is used (one feature) Classifier is trained with the text pdf's obtained

24 7 April 2006CS533 Information Retrieval Systems 24 An Oversimplified Sample Scenario Assume that the unattributed text has 10 words with 3 characters Check whether the author 1 or the author 2 has higher probability of having 10 words with 3 characters The unattributed text is assigned to the author with a higher probability for 10 words with 3 characters

25 7 April 2006CS533 Information Retrieval Systems 25 Support Vector Machines (SVMs) Supervised learning method for classification and regression Quite popular and successful in Text Categorization (Joachim et al.) Seeks for an hyper plane separating two classes by: Maximizing the margin Minimizing the classification error Solution is obtained using quadratic optimization techniques

26 7 April 2006CS533 Information Retrieval Systems 26 Support Vector Machines (SVMs) denotes +1 denotes -1 Sample adapted from Andrew Moore’s SVM slides

27 7 April 2006CS533 Information Retrieval Systems 27 Support Vector Machines (SVMs) denotes +1 denotes -1

28 7 April 2006CS533 Information Retrieval Systems 28 Support Vector Machines (SVMs) denotes +1 denotes -1

29 7 April 2006CS533 Information Retrieval Systems 29 Support Vector Machines (SVMs) denotes +1 denotes -1

30 7 April 2006CS533 Information Retrieval Systems 30 Support Vector Machines (SVMs) denotes +1 denotes -1

31 7 April 2006CS533 Information Retrieval Systems 31 Support Vector Machines (SVMs) denotes +1 denotes -1

32 7 April 2006CS533 Information Retrieval Systems 32 Support Vector Machines (SVMs) denotes +1 denotes -1 Margin

33 7 April 2006CS533 Information Retrieval Systems 33 Support Vector Machines (SVMs) denotes +1 denotes -1 Support Vectors define the hyperplane Maximum margin linear classifier, simplest SVM Support vectors lie on the margin and carry all the relevant information

34 7 April 2006CS533 Information Retrieval Systems 34 Support Vector Machines (SVMs)

35 7 April 2006CS533 Information Retrieval Systems 35 Support Vector Machines (SVMs) denotes +1 denotes -1 How to find the hyperplane? x=0

36 7 April 2006CS533 Information Retrieval Systems 36 Support Vector Machines (SVMs) denotes +1 denotes -1 Move training data into higher dimension with kernel functions x=0

37 7 April 2006CS533 Information Retrieval Systems 37 Support Vector Machines (SVMs) denotes +1 denotes -1 The hyperplane may not be linear in the original space x=0

38 7 April 2006CS533 Information Retrieval Systems 38 Support Vector Machines (SVMs) Basis functions are of the form: Common kernel functions: Polynomial Sigmoidal Radial basis

39 7 April 2006CS533 Information Retrieval Systems 39 Multi-class SVM SVM only works for binary classification, how to handle multi-class (N classes) cases? Create N SVMs SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” While predicting the output, assign the label of the SVM which puts the input point into furthest positive region

40 7 April 2006CS533 Information Retrieval Systems 40 SVM Issues Choice of kernel functions Computational complexity of the optimization problem

41 7 April 2006CS533 Information Retrieval Systems 41 k-Nearest Neighbour Classification Method Key idea: keep all the training instances Given query example, take vote amongst its k neighbours Neighbours are determined by using a distance function

42 7 April 2006CS533 Information Retrieval Systems 42 k-Nearest Neighbour Classification Method (k=1) (k=4) Probability interpretation: estimate p(y|x) as Sample adapted from Rong Jin’s slides

43 7 April 2006CS533 Information Retrieval Systems 43 k-Nearest Neighbour Classification Method Advantages: Training is really fast Can learn complex target functions Disadvantages Slow at query time: Efficient data structures are needed to speed up the query

44 7 April 2006CS533 Information Retrieval Systems 44 How to choose k? Use validation with leave-one-out method For k = 1, 2, …, K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal

45 7 April 2006CS533 Information Retrieval Systems 45 How to choose k? Use validation with leave-one-out method For k = 1, 2, …, K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal

46 7 April 2006CS533 Information Retrieval Systems 46 How to choose k? Use validation with leave-one-out method For k = 1, 2, …, K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal (k=1)

47 7 April 2006CS533 Information Retrieval Systems 47 How to choose k? Use validation with leave-one-out method For k = 1, 2, …, K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal Err(1) = 1

48 7 April 2006CS533 Information Retrieval Systems 48 How to choose k? Use validation with leave-one-out method For k = 1, 2, …, K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal Err(1) = 1

49 7 April 2006CS533 Information Retrieval Systems 49 How to choose k? Use validation with leave-one-out method For k = 1, 2, …, K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal Err(1) = 3 Err(2) = 2 Err(3) = 6 k = 2

50 7 April 2006CS533 Information Retrieval Systems 50 Future Work & Conclusion Preliminary features distributions seem discriminative Will apply classification methods on the feature set Will rank the features’ success rate May come up with new style markers


Download ppt "Authorship Attribution CS533 – Information Retrieval Systems Metin KOÇ Metin TEKKALMAZ Yiğithan DEDEOĞLU 7 April 2006."

Similar presentations


Ads by Google