M. Özgür Cingiz Assoc. Prof. Dr. Banu DİRİ Yıldız Technical University Computer Engineering Department.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

PhishZoo: Detecting Phishing Websites By Looking at Them
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
An Intelligent System for Dynamic Online TV Programming Allocation from TV Internet Broadcasting Thamar E. Mora, Rene V. Mayorga Faculty of Engineering,
Basic Searching Engineering Village. Agenda What is Engineering Village? Setting up a personal account Searching Engineering Village How to.
22 nd User Modeling, Adaptation and Personalization (UMAP 2014) Time-Sensitive User Profile for Optimizing Search Personalization Ameni Kacem, Mohand Boughanem,
Social Media.
Farag Saad i-KNOW 2014 Graz- Austria,
Improved TF-IDF Ranker
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Evaluating the Quality of Social Media in an Educational Context Kirsi Silius, Meri Kailanto & Anne-Maritta Tervakari Hypermedia Laboratory ITK’
Engineering Village ™ Basic Searching.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Information Retrieval in Practice
1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.
Text Classification: An Implementation Project Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo.
A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ Text Categorization For Turkish News.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
1 Multi-Perspective Question Answering Using the OpQA Corpus (HLT/EMNLP 2005) Veselin Stoyanov Claire Cardie Janyce Wiebe Cornell University University.
First Hand News Siu Lun Hong Meenakshi Lakshmikanthan Abirami Mangai.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Extracting Hidden Components from Text Reviews for Restaurant Evaluation Juanita Ordonez Data Mining Final Project Instructor: Dr Shahriar Hossain Computer.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Combining Unsupervised Feature Selection.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Blogging. Website and blog A website, also written as web site,or simply site, is a set of related web pages typically served from a single web domain.
Reputation Management System
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Information Retrieval in Practice
A Simple Approach for Author Profiling in MapReduce
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Sentiment analysis algorithms and applications: A survey
Document Filtering Social Web 3/17/2010 Jae-wook Ahn.
Presented by: Prof. Ali Jaoua
Asist. Prof. Dr. Duygu FIRAT Asist. Prof.. Dr. Şenol HACIEFENDİOĞLU
Project 1: Text Classification by Neural Networks
Text Categorization Assigning documents to a fixed set of categories
Pooria Taghizadeh : Dr. Hadi Tabatabaee : Dr. Mona Ghassemian :
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Austin Karingada, Jacob Handy, Adviser : Dr
Presentation transcript:

M. Özgür Cingiz Assoc. Prof. Dr. Banu DİRİ Yıldız Technical University Computer Engineering Department

Content Aim & Scope Social Networks and Microbloggers Related Works Proposed System Training Phase Testing Phase Experimental Results and Discussion

Aim & Scope Microbloggers contents are evaluated with respect to how they reflect their categories Category information of microbloggers is foreknown Users contents are also checked according to up-to- dateness by using RSS news feeds Two types of users contents are used as test data Contents of Normal Users Contents of Bots

Social Networks After the emergence of Web 2.0 concept, people cannot be regarded as simple content readers since they can also contribute content as writers Web 2.0 introduces concepts such as social network, blogs and microblogs User share his/her opinion, feeling, images, favorite videos and other user contributions as microblog content

Social Networks Users keep in touch with one another in social networks Their contents and field of interests connects users to each other Microblogs are one of the most popular social network areas (Twitter, Tumblr,identi.ca,Jaiku) Microblogs has a limitation of characters for content

Microblogs In this work we utilize Twitter data User content is known as tweet in Twitter A tweet has a limitation of 140 characters According to 2012 statistics It has 465 million registered users 175 million tweets are generated by users in each day Enormous amount of raw data is very attractive for researchers

Related Works Ece extracted word-hashtag, user-hashtag and word- user relations from tweets to discover users common interest area Emre utilized content of normal user and content of bots in his work. He discovered that contents of bots are more categorical than content of normal users. Duygu extracted categorical features from tweets of 150 users to build up social network by using these features

Related Works Okay examined patterns in 1250 news and he discovered that %80 of news contain N-V-N pattern. He look for this pattern in tweets for discovering news in tweets Baris used microbloggers contents and text classification techniques to measure convenience of users categories.

What does user want in Twitter ? Users follow other users according to their field of interest A follower anticipates other users to enter contents about their category A follower anticipates other users to enter recent contents about their category This works intend to determine how users reflect their category Contents are also evaluated according to up-to- dateness

Proposed System Structure

Why do we choose RSS News Feeds as training data ? News providers like BBC,CNN and others supply category of RSS News Feeds We want to investigate up-to-date tweets so we look for current tweets by using RSS feeds RSS News Feeds summary of the news so we can get important few terms and eliminate less distinctive terms from the news RSS News Feeds has more reliable content than tweets. Tweets may not be as informative as RSS News Feeds

RSS News Feeds RSS (Rich Site Summary) is a Web Feed format RSS document is an XML file that contains a number of discrete news items

RSS News Feeds

Training Phase We used 2105 RSS News Feeds in training phase Four different categories are taken to form trainig modal These categories are: Sports, Technology, Economy and Entertainment 543 Rss feeds for Sports 470 Rss feeds for Technology 548 Rss feeds for Economy 544 Rss feeds for Entertainment

Training Phase Preprocessing First we remove punctuation of RSS News feeds Second step is tokenization step In this study words are used as tokens (terms) In previous text classification works features are evaluated separately according to their linguistic labels. Nouns and Verbs are obtained more distinctive so we used only Nouns and Verbs to decrease the size of feature space Same terms can be used as different formats so lemmatization is used for all terms Drink, drank, drunk drink Elimination of stop words that has no distinctiveness for classification

Training Phase TF-IDF weighting method is applied for all terms After all preprocessing steps, training model contains 7212 features so feature reduction is necessary for feature set We used 2 different feature selection methods to specify the best feature subset. These are Information Gain and Chi Square Statistics With using Weka tool We tried different threshold values for feature selection phase

Training Phase The highest F- Measure value is 95.2% by using Chi- Square Statistics as feature selection method and Multinominal Naive Bayes as classifier 7212 features is reduced to 1277 features by using feature selection methods Only these 1277 features are used in study as feature set that are gathered from training data After all these steps, Multinominal Naive Bayes and Suppor Vector Machines are used as classifier for classification

Training Phase Feature Set of Proposed System

Testing Phase After forming training modal, tweets of 26 normal users and tweets of 27 bots are used as test data (6671 tweets for testing phase) Category information of Twitter users are obtained from wefollow.com application (we get same categories: sports, entertainment, technology and economy for classification) How can we know that a user is bot or normal user? After examination of user tweets, if a user contancts with other users,we categorize user as normal user otherwise we categorize user as bot # of Normal Users # of Bots Sports711 Entertainment75 Technology65 Entertainment75

Testing Phase

Removal of punctuations, tokenization, and selection of features in terms of their linguistic information, stemming and elimination of stop words are preprocessing steps that we used for tweets too Hyperlinks of images and videos are also eliminated from tweets

Testing Phase We want to check up-to-dateness of tweets about their category so After all preprocessing applied for tweets, every word in tweets is not considered as feature for checking up-to-dateness of tweets If a word in tweets is not in training feature set, this word is eliminated We look for features that are obtained from training feature set in tweets So We can eliminate abbrevations and meaningless words in tweets We can check up-to-dateness of tweets ( according to current news)

Testing Phase After feature selection part, TF-IDF weighting is applied for all terms A tweet has 140 character limitation so a tweet doesnt consist of a lot of words After all preprocessing steps and feature selection criteria, some tweets become featureless or less features so We specified three term count threshold values

Testing Phase These three term count threshold values are >2 (greater than two): Tweets must have at least 2 terms. >3 (greater than three): Tweets must have at least 3 terms. >4 (greater than four): Tweets must have at least 4 terms. These three different test data sets are used separetly in testing phase

Testing Phase # OF USER TWEETS Term Count Threshold Values >2>3>4 # of tweets of Normal Users # of tweets of Bots For both user types, number of tweets decrease when term count threshold value increases

Testing Phase In testing phase 2 classifiers SVMs and MNNB 3 different term count threshold values >2, >3 and >4 2 different types of user tweets Tweets of bots and tweets of normal users F-measure is used for evaluation of classification performance

Experimental Results & Discussion TERM COUNT THRESHOLD VALUES MNNB||SVMs F-Measure >2 >3 >4 Tweets of Bots Tweets of Normal Users Classification performance of bots tweets are higher than classification performance of normal users tweets Tweets of bots are more categorical than tweets of normal users MNNB outperforms SVMs in terms of classification performance at each threshold value Classification performances of tweets increase when term count threshold value increases (it is valid for tweets of both user types ) It proves that tweets which have more terms gives better results (it is valid for tweets of both user types )

Conclusion In this study We want to evaluate how normal users and bots reflect their categories We used RSS News Feeds to check users content is uptodated or not We examined classification results and these results give that content of bots reflect their categories more than content of normal users and also tweets of bots are more updated than tweets of normal users.

References Aslan, O., Revealing An Analysis of News On Microblogging Systems, Master Tezi, Boğaziçi Üniversitesi, 2010 Shamma, D. A., L. Kennedy, and E.F. Churchill, Tweet the debates: understanding community annotation of uncollected sources, WSM09: Proceedings of first SIGMM workshop ons Social media, pp. 3-10, ACM, New York, NY, USA, 2009 Akman, D. S., Revealing Microblogger Interests By Analyzing Contributions, Master Tezi, Boğaziçi Üniversitesi, 2010 Yurtsever, E., Sweettweet: A Semantic Analysis For Microblogging Environments, Master Tezi, Boğaziçi Üniversitesi, 2010 Vieweg, S., A. L, Hughes, K.Starbird, and L. Palen, Microblogging during two natural hazards events: what twitter may contribute to situational awareness.", Mynalt, E. D., D. Schoner, G. Fitzpartrick, S. E. Hudson, K. Edwards, and T. Rodden(editors), CHI, pp , ACM, Güç, B., Information Filtering on Micro-blogging Services, Masters Thesis, Swiss Federal Institute of Technology Zürich, 2010 Leopold, E., and Kindermann, J.,Text categorization with support vector machines. How to represent texts in input space?, Machine Learning 46, pp , 2002 Yang, Y., ve Liu,X., A Re-Examination of Text Categorization Methods, In Proc 22nd Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval,42-49,

Thank You 30