Presentation on theme: "Political Party, Gender, and Age Classification Based on Political Blogs Michelle Hewlett and Elizabeth Lingg."— Presentation transcript:
Political Party, Gender, and Age Classification Based on Political Blogs Michelle Hewlett and Elizabeth Lingg
Introduction Can individuals be classified by their writing style? Do people under 25 use different punctuation than those over 25? Do they use different words and phrases? Can you figure out someone’s political ideologies by analyzing their writing using probabilistic methods?
Classifier Hold Out Cross Validation 80% of Data in Training Set 20% of Data in Test Set Classify Bloggers using a Feature Vector Features generated from training data
Features Most frequent unigrams, bigrams, trigrams “Bush”, “troops in Iraq”, “McCain” Sentence length, Word length Punctuation Pronoun usage
Features Compute feature probabilities based on frequency in the training data If women use the word “myself” three times as often as men use the word “myself,” P(female|myself) = 75% Pick features which are not 50/50 male/female or 50/50 Republican/Democrat
Classification Using the feature vector to classify, bloggers with a low probability of being a Republican were classified as Democrat Writers with high Probability of being a Republican were classified as Republican Writers with moderate Probability were not classified or “Unknown”
Clustering K-means clustering algorithm used with entire data set Used sum of absolute differences instead of Euclidean distance because our differences were so small Initialized centroids to a reasonable guess
Clustering Results o Democrat Cluster 1 * Democrat Cluster 2 o Republican Cluster 1 * Republican Cluster 2 o Unknown Cluster 1 * Unknown Cluster 2
Clustering Results o Male Cluster 1 * Male Cluster 2 o Female Cluster 1 * Female Cluster 2 o Unknown Cluster 1 * Unknown Cluster 2
Conclusion It is possible to identify the characteristics of a writer based on writing style, words and phrases! Political Party gave the best results, followed by Gender, then Age
Future Work Generalize results with a larger data set and greater number of features Generalize results in a different domain Possibly implement linear regressions, logistic regressions, SVM
Your consent to our cookies if you continue to use this website.