Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed.

Similar presentations


Presentation on theme: "Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed."— Presentation transcript:

1 Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed

2 Introduction Stylometry Major problems facing stylometry Decision trees Artificial Neural Networks

3 Stylometry The measure of style Fundamental assumption: there is an unconscious aspect to an author’s style that cannot be consciously manipulated but which possesses quantifiable and distinctive features. Major applications today: clinical tools in disease detection and forensic tools in court trials, text categorization, author attribution.

4 Major problems facing stylometry no consensus as to what characteristic features to use Which indicators to use – word length, sentence length, tests of position, the distribution of once-occurring words (hapax legomena), the frequencies of marker words, letter sequence, syllable length or syntactical measures?

5 Major problems facing stylometry No consensus as to what methodology or techniques to apply in standard research Which techniques to use -- statistical methods and automated pattern recognition methods? Statistical methods: e.g. Bayesian analysis, cluster analysis such as the widely used Principal Components Analysis (PCA). Automated pattern recognition methods: e.g. Artificial Neural Networks (ANN), Genetic Programming (GP).

6 Significant Features of our paper Recognizing the works of five authors Use of unconventional indicators such as punctuation marks as well as standard indicators such as function words Only 21 indicators, which shows that not many features are required for high-performance classification as opposed to common belief

7 Data Extraction 78 samples from five popular Victorian authors –Jane Austen: Pride and Prejudice Chapters 1-5 Mansfield Park Chapters 1-5 Emma Chapters 1-5 Sense and Sensibility Chapters 1-5

8 –Charles Dickens David Copperfield Chapters 1-5 Great Expectations Chapters 1-5 Hard Times Chapters 1-6 Tale of Two Cities Chapters 1-6 -- William Thackeray Vanity Fair Chapters 1-6 Men’s Wives Chapters 1-6 –Emily Bronte Wuthering Heights Chapters 1-12 –Charlotte Bronte Jane Eyre Chapters 1-12

9 21 attributes as input type-token ratio mean word length mean sentence length standard deviation of sentence length mean paragraph length chapter length number of commas per thousand tokens number of semicolons per thousand tokens number of quotation marks per thousand tokens

10 number of exclamation marks /1000 tokens number of hyphens per thousand tokens number of and’s per thousand tokens number of but’s per thousand tokens number of however’s per thousand tokens number of if’s per thousand tokens number of that’s per thousand tokens number of more’s per thousand tokens number of must’s per thousand tokens number of might’s per thousand tokens number of this’s per thousand tokens number of very’s per thousand tokens

11 Decision Tree Learning See5 package by Quinlan based on ID3 algorithm features of decision tree: results easy to understand; focus on individual attributes Use fuzzy thresholds for continuous values Either winnowing or boosting gives the best result: 82.4% accuracy, significantly above random guess (20%).

12 Result from winnowing: Evaluation on test data (17 cases): Decision Tree ---------------- Size Errors 5 3(17.6%) << (a) (b) (c) (d) (e) <-classified as ---- ---- ---- ---- ---- 4 1 (a): class jane 5 1 (b): class charles 2 (c): class william 1 1 (d): class emily 2 (e): class charlotte

13 Results from boosting: Evaluation on test data (17 cases): boost 3(17.6%) << (a) (b) (c) (d) (e) <-classified as ---- ---- ---- ---- ---- 4 1 (a): class jane 5 1 (b): class charles 2 (c): class william 1 1 (d): class emily 2 (e): class charlotte

14 Artificial Neural Network (ANN) Learning practical and powerful method of pattern recognition can invent new features that are not explicit in the input all attributes taken into consideration inductive rules not accessible to humans

15 Many architectures were tried. Kohonen SOM, Probabilistic nets, Nets based on statistical model were tried Back propagation feed forward nets gave the best results The best network had 21 inputs and 10 outputs The best architecture had 15 hidden nodes in the first hidden layer and 11 in the second

16 Predictor analysis

17 Results from ANN ( a) (b) (c) (d) (e)  classified as ---- ---- ---- ---- ---- 2 (a): class jane 2 (b): class charles 2 (c): class william 2 4 (d): class emily 5 (e): class charlotte

18 Misclassifications: No. 4: Pride and Prejudice Chapter 3 is misclassified as written by Charlotte Bronte Nos. 67 & 71: Tale of Two Cities Chapter 1 and Chapter 5 are misclassified as written by William Thackeray. All the other authors are correctly classified. (88.2% accuracy on the validation set)

19 Conclusion Very good results were obtained in both the experiments Artificial Intelligence provides stylometry with excellent classifiers that require fewer input variables than traditional statistics Future Research –GA/GP –a general classifier applicable to all authors –Different set of features

20 Thank you ?


Download ppt "Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed."

Similar presentations


Ads by Google