Presentation is loading. Please wait.

Presentation is loading. Please wait.

M. Özgür Cingiz Assoc. Prof. Dr. Banu DİRİ Yıldız Technical University Computer Engineering Department.

Similar presentations

Presentation on theme: "M. Özgür Cingiz Assoc. Prof. Dr. Banu DİRİ Yıldız Technical University Computer Engineering Department."— Presentation transcript:

1 M. Özgür Cingiz Assoc. Prof. Dr. Banu DİRİ Yıldız Technical University Computer Engineering Department

2 Content Aim & Scope Social Networks and Microbloggers Related Works Proposed System Training Phase Testing Phase Experimental Results and Discussion

3 Aim & Scope Microbloggers contents are evaluated with respect to how they reflect their categories Category information of microbloggers is foreknown Users contents are also checked according to up-to- dateness by using RSS news feeds Two types of users contents are used as test data Contents of Normal Users Contents of Bots

4 Social Networks After the emergence of Web 2.0 concept, people cannot be regarded as simple content readers since they can also contribute content as writers Web 2.0 introduces concepts such as social network, blogs and microblogs User share his/her opinion, feeling, images, favorite videos and other user contributions as microblog content

5 Social Networks Users keep in touch with one another in social networks Their contents and field of interests connects users to each other Microblogs are one of the most popular social network areas (Twitter, Tumblr,,Jaiku) Microblogs has a limitation of characters for content

6 Microblogs In this work we utilize Twitter data User content is known as tweet in Twitter A tweet has a limitation of 140 characters According to 2012 statistics It has 465 million registered users 175 million tweets are generated by users in each day Enormous amount of raw data is very attractive for researchers

7 Related Works Ece extracted word-hashtag, user-hashtag and word- user relations from tweets to discover users common interest area Emre utilized content of normal user and content of bots in his work. He discovered that contents of bots are more categorical than content of normal users. Duygu extracted categorical features from tweets of 150 users to build up social network by using these features

8 Related Works Okay examined patterns in 1250 news and he discovered that %80 of news contain N-V-N pattern. He look for this pattern in tweets for discovering news in tweets Baris used microbloggers contents and text classification techniques to measure convenience of users categories.

9 What does user want in Twitter ? Users follow other users according to their field of interest A follower anticipates other users to enter contents about their category A follower anticipates other users to enter recent contents about their category This works intend to determine how users reflect their category Contents are also evaluated according to up-to- dateness

10 Proposed System Structure

11 Why do we choose RSS News Feeds as training data ? News providers like BBC,CNN and others supply category of RSS News Feeds We want to investigate up-to-date tweets so we look for current tweets by using RSS feeds RSS News Feeds summary of the news so we can get important few terms and eliminate less distinctive terms from the news RSS News Feeds has more reliable content than tweets. Tweets may not be as informative as RSS News Feeds

12 RSS News Feeds RSS (Rich Site Summary) is a Web Feed format RSS document is an XML file that contains a number of discrete news items

13 RSS News Feeds

14 Training Phase We used 2105 RSS News Feeds in training phase Four different categories are taken to form trainig modal These categories are: Sports, Technology, Economy and Entertainment 543 Rss feeds for Sports 470 Rss feeds for Technology 548 Rss feeds for Economy 544 Rss feeds for Entertainment

15 Training Phase Preprocessing First we remove punctuation of RSS News feeds Second step is tokenization step In this study words are used as tokens (terms) In previous text classification works features are evaluated separately according to their linguistic labels. Nouns and Verbs are obtained more distinctive so we used only Nouns and Verbs to decrease the size of feature space Same terms can be used as different formats so lemmatization is used for all terms Drink, drank, drunk drink Elimination of stop words that has no distinctiveness for classification

16 Training Phase TF-IDF weighting method is applied for all terms After all preprocessing steps, training model contains 7212 features so feature reduction is necessary for feature set We used 2 different feature selection methods to specify the best feature subset. These are Information Gain and Chi Square Statistics With using Weka tool We tried different threshold values for feature selection phase

17 Training Phase The highest F- Measure value is 95.2% by using Chi- Square Statistics as feature selection method and Multinominal Naive Bayes as classifier 7212 features is reduced to 1277 features by using feature selection methods Only these 1277 features are used in study as feature set that are gathered from training data After all these steps, Multinominal Naive Bayes and Suppor Vector Machines are used as classifier for classification

18 Training Phase Feature Set of Proposed System

19 Testing Phase After forming training modal, tweets of 26 normal users and tweets of 27 bots are used as test data (6671 tweets for testing phase) Category information of Twitter users are obtained from application (we get same categories: sports, entertainment, technology and economy for classification) How can we know that a user is bot or normal user? After examination of user tweets, if a user contancts with other users,we categorize user as normal user otherwise we categorize user as bot # of Normal Users # of Bots Sports711 Entertainment75 Technology65 Entertainment75

20 Testing Phase

21 Removal of punctuations, tokenization, and selection of features in terms of their linguistic information, stemming and elimination of stop words are preprocessing steps that we used for tweets too Hyperlinks of images and videos are also eliminated from tweets

22 Testing Phase We want to check up-to-dateness of tweets about their category so After all preprocessing applied for tweets, every word in tweets is not considered as feature for checking up-to-dateness of tweets If a word in tweets is not in training feature set, this word is eliminated We look for features that are obtained from training feature set in tweets So We can eliminate abbrevations and meaningless words in tweets We can check up-to-dateness of tweets ( according to current news)

23 Testing Phase After feature selection part, TF-IDF weighting is applied for all terms A tweet has 140 character limitation so a tweet doesnt consist of a lot of words After all preprocessing steps and feature selection criteria, some tweets become featureless or less features so We specified three term count threshold values

24 Testing Phase These three term count threshold values are >2 (greater than two): Tweets must have at least 2 terms. >3 (greater than three): Tweets must have at least 3 terms. >4 (greater than four): Tweets must have at least 4 terms. These three different test data sets are used separetly in testing phase

25 Testing Phase # OF USER TWEETS Term Count Threshold Values >2>3>4 # of tweets of Normal Users # of tweets of Bots For both user types, number of tweets decrease when term count threshold value increases

26 Testing Phase In testing phase 2 classifiers SVMs and MNNB 3 different term count threshold values >2, >3 and >4 2 different types of user tweets Tweets of bots and tweets of normal users F-measure is used for evaluation of classification performance

27 Experimental Results & Discussion TERM COUNT THRESHOLD VALUES MNNB||SVMs F-Measure >2 >3 >4 Tweets of Bots Tweets of Normal Users Classification performance of bots tweets are higher than classification performance of normal users tweets Tweets of bots are more categorical than tweets of normal users MNNB outperforms SVMs in terms of classification performance at each threshold value Classification performances of tweets increase when term count threshold value increases (it is valid for tweets of both user types ) It proves that tweets which have more terms gives better results (it is valid for tweets of both user types )

28 Conclusion In this study We want to evaluate how normal users and bots reflect their categories We used RSS News Feeds to check users content is uptodated or not We examined classification results and these results give that content of bots reflect their categories more than content of normal users and also tweets of bots are more updated than tweets of normal users.

29 References Aslan, O., Revealing An Analysis of News On Microblogging Systems, Master Tezi, Boğaziçi Üniversitesi, 2010 Shamma, D. A., L. Kennedy, and E.F. Churchill, Tweet the debates: understanding community annotation of uncollected sources, WSM09: Proceedings of first SIGMM workshop ons Social media, pp. 3-10, ACM, New York, NY, USA, 2009 Akman, D. S., Revealing Microblogger Interests By Analyzing Contributions, Master Tezi, Boğaziçi Üniversitesi, 2010 Yurtsever, E., Sweettweet: A Semantic Analysis For Microblogging Environments, Master Tezi, Boğaziçi Üniversitesi, 2010 Vieweg, S., A. L, Hughes, K.Starbird, and L. Palen, Microblogging during two natural hazards events: what twitter may contribute to situational awareness.", Mynalt, E. D., D. Schoner, G. Fitzpartrick, S. E. Hudson, K. Edwards, and T. Rodden(editors), CHI, pp , ACM, Güç, B., Information Filtering on Micro-blogging Services, Masters Thesis, Swiss Federal Institute of Technology Zürich, 2010 Leopold, E., and Kindermann, J.,Text categorization with support vector machines. How to represent texts in input space?, Machine Learning 46, pp , 2002 Yang, Y., ve Liu,X., A Re-Examination of Text Categorization Methods, In Proc 22nd Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval,42-49,

30 Thank You 30

Download ppt "M. Özgür Cingiz Assoc. Prof. Dr. Banu DİRİ Yıldız Technical University Computer Engineering Department."

Similar presentations

Ads by Google