Presentation on theme: "D EMOCRATS, R EPUBLICANS AND S TARBUCKS A FFICIONADOS : U SER C LASSIFICATION IN T WITTER K DD ‘11 Utku Şirin 1560838."— Presentation transcript:
D EMOCRATS, R EPUBLICANS AND S TARBUCKS A FFICIONADOS : U SER C LASSIFICATION IN T WITTER K DD ‘11 Utku Şirin
U SER C LASSIFICATION IN T WITTER, A UTHORS Marco Pennacchiotti Research Scientist at Yahoo! Labs PhD is from Uni. Of Rome Studied in Saarland University Large-Scale Text Mining, Information Extraction, and Natural Language Processing Ana-Maria Popescu Research Scientist at Yahoo! Labs Graduated from University of Washington Social Media Research and Analytics, User modelling, and Sentiment Analysis
S OCIAL M EDIA Hotly growing phenomenan Everyone is there, the conservatives and the revolutionaries ! As Data Miners, what we are interested in is the very large number of available data about the social users A basic and important task: Classification of the Users Authorative users extraction Post reranking in web search (KDDCUP ‘12, Track #2) User recommendation How to do the classification ?
C LASSIFICATION TASK The starting point is to fulfill the incomplete user attributes by classifiying the user with respect to the incomplete user attribute, indeed. Most of the users do not mention explicitly her political view, for example There are various methods for solving the user classification problem What do we have in social media domain ? Users have many attributes, such as age, gender, etc… Based on the attributes a classifier may be trained/constructed Social Network Users have friends that she follows How to define the classification task so that we can combine these two types of information ‘structure’, user attributes and social network ?
M ACHINE LEARNING MODEL A novel architecture combining user-centric information and social network information User-centric information are the attributes of the users, which we call as features hereafter Social Network information is the information of friends of the users Main contribution of the paper Use Gradient Boosted Decision Trees (GBDT) framework as the classification algorithm Train the GDBT with given labeled input data And label the users with respect to the built classifier Then apply same classifier model to the friends of the users and label the friends also Lastly, update each user’s label with respect to her friends’ label using an update formulae
U SER -C ENTRIC I NFORMATION User-centric information is represented as features. There is a overmuch feature set mainly comprised of four parts Profile features(PROF) User name, use of avatar picture, date of account creation, etc… Tweeting behavior features(BEHAV) Average number of tweets per day, number of replies etc... Linguistic content features Richest feature set, comprised of four sub-feature sets Uses Latent Drichlet Allocation (LDA) as Language Model Prototypical words(LING-WORD): Proto words, words that are icons in users. Found probabilistically from the data Firstly partition the users into n class, then find the most frequent words for each class and take mostly used k words for each class Prototypical hashtags(LING-HASH): Hashtag (#) to denote topics Same technique for proto words Generic LDA(LING-GLDA): LDA is the language model they used, extracted topics with respect to the LDA model and represents users as a distribution over topics LDA is trained by all sets of users Domain-specific LDA(LING-DLDA): Same as Generic LDA, but trained with specific training set such as users that are only democrats and republicans Sentiment words(LING-SENT): Manually collected small set of terms, Ronald Regan, good or bad ? Opinion Finder Tool gives the sentiment as positive, negative, neutral
U SER -C ENTRIC I NFORMATION Social Network Features Combination of two different features Friend accounts(SOC-FRIE): Informs about sharing same friends for different labeled users such as democrats and republicans Prototypical replied(SOC-REP) and retweeted (SOC- RET) users: Find most frequent mentioned and retweeted (RT) users for different labeled users That’s all for user-centric information OVERMUCH, indeed…
L ABEL U PDATE U SING S OCIAL N ETWORK
E XPERIMENTAL E VALUATION Three binary classification tasks: Detecting political affiliation Democrat or Republican 5169 Democrats and 5169 Republicans 1.2 millions friends Ethnicity African American or Not 3000 African Americans and 3000 Not African Americans 508K friends Following a business Following Starbucks or Not 5000 Starbucks follower and 5000 Not 981K friends
E XPERIMENTAL R ESULTS, P OLITICAL A FFILIATION T ASK Best achieved result for combined HYBRID model among three tasks however, not significant increase over single ML model Social Network features are very successfull. This is because users from a particular political view are friends with similar particular views. Suportting sinle Graph- Based Label update is also very successfull alone
E XPERIMENTAL R ESULTS, S TARBUCKS F ANS T ASK Social Graph update is not that much successfull as political affiliation task since Starbucks does not build friends, indeed Profile features are very successfull alone Linguistic features are also successfull HYBRID method still does not increase the alone ML system significantly
E XPERIMENTAL R ESULTS, E THNICITY T ASK HYBRID method fails, decreases the alone ML model Social network features a so bad ! As in Starbukcs Task case, ethnicity does not form a community. Hence, social network features and graph-based update has very low results Best feature alone results are in linguistic features. Linguistic features always have a point !
O VERALL C OMMENTS #1 ML method mostly good enough and update part of the architecture does not bring significant improvement. If the task allows for users to form a community update function works, else, it may even hurt the alone ML system as in ethnicity case #2 Linguistic Features always reliable
R EVIEW #1 The novelty of combining the types of information is attractive, however, there are serious points that should be criticized First of all the classifier is doing only binary classification and nothing said about multi-dimensional classification. Doing multi-dimensional classification using binary classifier is time-consuming and weakens the claim about the scalability. As said, the novel arch. idea is attractive, however, the results show that label- update does not work well. Why ? They did not give any appriciable comment on why label update does not work well. This, I believe, shows that the feature set and the novel architecture is not well-studied. There are overmuch features. But the reasons why these features are selected are not given. Morever, applying same ML model the users and their friends replicates the information. Obviously connected users will have some common and different attributes, what is the point? The social graph should be used more effectively. I think it should not be used to update the labels but as an importantly weigthed feature in the ML model. This is because we should superpose different information types instead of using one to compensate the other. You can see difference in thinking vector space, update means spanning same vector again, superposing means using both vector concurrently. For example, proto words would have been extracted using the network, somehow.
R EVIEW #2 They told about Gradient Boosted Decision Trees (GBDT) but gave nothing about this classification algorithm, an explanation is expected at least in princpile about GBDT. Same thing is valid for Latent Drichlett Allocation (LDA) language model. It is the first time I hear this language model, and they said nothing about LDA. It is only said that LDA is used as language model and associated with topics. But, what is LDA and how it is associated with topics? There is no data analysis, very cruical lacking of paper, everything is data! They only gave the number of users used in training, but what about the test set? Development set? Any other statistics about the data? Moreover, they used different number of samples for each task. The success of label update is very low for ethnicity task than the political affiliation task, however, there are 1.2M friends for political affiliation task but almost half of them for ethnicity task, 508K. Hence the cross-task comments are not confident. The system they built have a stroing constraint, indeed. It is language dependent, English. For example, the features based on frequencies of proto words will not work for Turkish due to its agglutinative nature, many inflected forms of same words: masayı, masada, masanın, masalardakilkerin etc… (A stemmer will be need most probably) Experiments are not done in a structured way. They have just done the experiments and shows the results. There is not a useful comment. Beside, they did not explain why they have chosen these experiments. For example, I would want to see some success of subset features as features alone have mostly very good results, some subset may increase the overall HYBRID result.