Presentation on theme: "Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier."— Presentation transcript:
Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier
Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology Input: user timeline tweets Output: list of auto classified tweets
Twitter allows users to create custom Friend Lists based on the user handles.
Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.
Step 1: Data Collection Step 2: Text mining Step 3: Creation of the training file for the library Step 4: Evaluation of several classifiers Step 5: Selecting the best classifier Step 6: Validating the classification Step 7: Tuning the parameters Step 8: Repeat; until correct classification
Remove special characters Tokenize Remove redundant letters in words Spell Check Stemming Language Identification Remove Stop Words Generate bigrams and change to lower case
Go SF Giants! Such an amaazzzing feelin’!!!! \m/ :D SF Giants! amaazzzing feelin’!!!! \/ :D SF Giants amaazzzing feelin SF Giants amazing feeling SF Giants amazing feel me SF Giants amazing feel Stopwords Special chars Spell check Stemming stopwords
Logistic Regression Classifier Reasons: Most popular linear classification technique for text classification Ability to handle multiple categories with ease Gave the best cross-validation accuracy and precision-recall score Library: LIBLINEAR for Python
SF Giants amazing feel SF – 1 Giants -2 amazing-3 feel-4 SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1) 1 1:1 2:1 3:1 4:1 Boolean Training Input for the SVM Indexing
Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business” Tweets were not purely “Sports” or “Business” related Personal messages were prominent Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly
Noise in the data: ▪ Tweets are in inconsistent format ▪ Lots of meaningless words ▪ Misspellings ▪ More of individual expression ▪ For example, BAAAAAAAAAAAASSKEttt!!!! bskball, futball, %, :D,\m/, ^xoxo Solution: Regular expressions and NLP toolkit Different words, same root Playing, plays, playful - play Solution: Stemming
Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4) Comma separated values of the categories that each tweet Accuracy here is 94%. Precision: 0.89 Recall: 0.89 Experiment with different kernels for a better accuracy
Category based tweets from https://twitter.com/i/#!/who_to_follow/interests https://twitter.com/i/#!/who_to_follow/interests Coding done in Python Database – sqlite3 ML tool – lib SVM Stemming – Porter’s Stemming NLP Tool kit