Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning.

Slides:



Advertisements
Similar presentations
Mining User Similarity Based on Location History Yu Zheng, Quannan Li, Xing Xie Microsoft Research Asia.
Advertisements

The 20th International Conference on Software Engineering and Knowledge Engineering (SEKE2008) Department of Electrical and Computer Engineering
Data Mining and Machine Learning Lab eTrust: Understanding Trust Evolution in an Online World Jiliang Tang, Huiji Gao and Huan Liu Computer Science and.
Understanding Cancer-based Networks in Twitter using Social Network Analysis Dhiraj Murthy Daniela Oliveira Alexander Gross Social Network Innovation Lab.
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
Sarcasm Detection on Twitter A Behavioral Modeling Approach
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Connecting Users across Social Media Sites: A Behavioral-Modeling Approach Jingchi Zhang.
Commentary-based Video Categorization and Concept Discovery By Janice Leung.
The Social Web: A laboratory for studying s ocial networks, tagging and beyond Kristina Lerman USC Information Sciences Institute.
COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.
Research Meeting Seungseok Kang Center for E-Business Technology Seoul National University Seoul, Korea.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Right Buddy Makes the Difference: an Early Exploration of Social Relation Analysis in Multimedia Applications Jitao Sang, Changsheng Xu*. 1 Institute of.
REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Authors: Xu Cheng, Haitao Li, Jiangchuan Liu School of Computing Science, Simon Fraser University, British Columbia, Canada. Speaker : 童耀民 MA1G0222.
Modeling Relationship Strength in Online Social Networks Rongjing Xiang: Purdue University Jennifer Neville: Purdue University Monica Rogati: LinkedIn.
Data Mining and Machine Learning Lab Exploring Temporal Effects for Location Recommendation on Location-Based Social Networks Huiji Gao, Jiliang Tang,
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
Chapter 8 Introduction to Hypothesis Testing
Advanced Software Engineering PROJECT. 1. MapReduce Join (2 students)  Focused on performance analysis on different implementation of join processors.
Data Mining and Machine Learning Lab Network Denoising in Social Media Huiji Gao, Xufei Wang, Jiliang Tang, and Huan Liu Data Mining and Machine Learning.
Group Recommendations with Rank Aggregation and Collaborative Filtering Linas Baltrunas, Tadas Makcinskas, Francesco Ricci Free University of Bozen-Bolzano.
Streaming Predictions of User Behavior in Real- Time Ethan DereszynskiEthan Dereszynski (Webtrends) Eric ButlerEric Butler (Cedexis) OSCON 2014.
Intelligent DataBase System Lab, NCKU, Taiwan Josh Jia-Ching Ying 1, Eric Hsueh-Chan Lu 2 and Vincent S. Tseng 1 1 Institute of Computer Science and Information.
Tie Strength, Embeddedness & Social Influence: Evidence from a Large Scale Networked Experiment Sinan Aral, Dylan Walker Presented by: Mengqi Qiu(Mendy)
Internet Skills The World Wide Web (Web) consists of billions of interconnected pages of information from a wide variety of sources. In this section: Web.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering.
Predicting Positive and Negative Links in Online Social Networks
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Aemen Lodhi (Georgia Tech) Amogh Dhamdhere (CAIDA)
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 4 Designing Studies 4.2Experiments.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign User Profiling in Ego-network: Co-profiling Attributes and Relationships.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Prediction of Influencers from Word Use Chan Shing Hei.
Facebook: An online tool for offline marketing NCTU — Hsinchu, Taiwan Timothy MacKay.
Fig.1. Flowchart Functional network identification via task-based fMRI To identify the working memory network, each participant performed a modified version.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Mining Social Media Data Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab Nov.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 4 Designing Studies 4.2Experiments.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
Chapter 13 Repeated-Measures and Two-Factor Analysis of Variance
Network Community Behavior to Infer Human Activities.
KDD 2011 Doctoral Session Modeling Trustworthiness of Online Content V. G. Vinod Vydiswaran Advisors: Prof.ChengXiang Zhai, Prof.Dan Roth University of.
Click to Add Title A Systematic Framework for Sentiment Identification by Modeling User Social Effects Kunpeng Zhang Assistant Professor Department of.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 4 Designing Studies 4.2Experiments.
Mining information from social media
RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Unsupervised Streaming Feature Selection in Social Media
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.
Connecting Users across Social Media Sites: A Behavioral- Modeling Approach Reza Zafarani and Huan Liu KDD’13 Presenter: Changqing Luo, Zhihao Cao, and.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Personal Branding. Objectives How do you see yourself? How do others see you? What is your personal brand?
More than words: Social network’s text mining for consumer brand sentiments Expert Systems with Applications 40 (2013) 4241–4251 Mohamed M. Mostafa Reporter.
Evaluation Anisio Lacerda.
Big Data + Deep Learn = A Universal Solution?
Weakly Learning to Match Experts in Online Community
iSRD Spam Review Detection with Imbalanced Data Distributions
Statistical Data Analysis
An Introduction to Correlational Research
Presentation transcript:

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 1 Mining Social Media: Looking Ahead Huan Liu Joint work with DMML Members and Collaborators

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST Social Media Mining by Cambridge University Press

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 3 Social Media Data Is Big Social media is a key player in the Big Data era – We’re overwhelmed by the data and start appreciating data value – Data is ubiquitous, and – Social media data is a new species Big data will only become bigger – Rapid growth of linked data Big data is a good problem to have – And, many a time, big data may not be big enough

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 4 Traditional Media and Data Broadcast Media One-to-Many Communication Media One-to-One Traditional Data

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 5 Social Media: Many-to-Many Everyone can be a media outlet or producer Disappearing communication barrier Distinct characteristics – User generated content: Massive, dynamic, extensive, instant, and noisy – Rich user interactions: Linked data – Collaborative environment: Wisdom of the crowd – Many small groups: The long tail phenomenon; and – Attention is hard to get

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 6 Big Data, Big and New Challenges Big-Data Paradox – Lack of data with big social media data Trust vs. Distrust in Social Media – Is distrust information important? – Where can we find distrust information with “one- way” relations? Data Sample Sufficiency – Often we get a small sample of (still big) data. Would that sample suffice to obtain credible findings? – How could we verify it with limited sample data?

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST Collectively, social media data is indeed big Individually, however, the data is little – How much activity data do we generate daily? – How many posts did we post this week? – How many friends do we have? Our data also appears at various social media sites as we use them for different purposes – Facebook, Twitter, Instagram, YouTube, … When “big” social media data isn’t big enough, – Searching for more data with little data A Big-Data Paradox

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST Little data about an individual + Many social media sites - Partial Information + Complementary Information > Better User Profiles An Example Twitter LinkedIn Age Location Education Reza Zafarani N/A Tempe, AZ ASU Can we connect individuals across sites? Connectivity is not available Consistency in Information Availability Reza Zafarani and Huan Liu. ``Connecting Users across Social Media Sites: A Behavioral-Modeling Approach", the Nineteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'2013), August , Chicago, Illinois.KDD'2013

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST Searching for More Data with Little Data

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 10 Behavior 1 Behavior 2 Behavior n Information Redundancy Feature Set 1 Feature Set 2 Feature Set n GeneratesGenerates Captured Via Learning Framework Data Identification Function A Behavioral Modeling Approach with Learning

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 11 Behaviors Human Limitation Time & Memory Limitation Knowledge Limitation Exogenous Factors Typing Patterns Language Patterns Endogenous Factors Personal Attributes & Traits Habits

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 12 59% of individuals use the same username Time and Memory Limitation

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 13 Identifying individuals by their vocabulary size Alphabet Size is correlated to language: Alphabet Size is correlated to language: शमंत कुमार -> Shamanth Kumar Knowledge Limitation

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 14 Adding Prefixes/Suffixes, Abbreviating, Swapping or Adding/Removing Characters Nametag and Gateman Usernames come from a language model Habits - old habits die hard

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 15 For each username: 414 Features Similar Previous Methods: 1)Zafarani and Liu, )Perito et al., 2011 Baselines: 1)Exact Username Match 2)Substring Match 3)Patterns in Letters Obtaining Features from Usernames

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 16 Many a time, big data may not be sufficiently big for a data mining task Gathering more data is often necessary for effective data mining Social media data provides unique opportunities to do so by using numerous sites and abundant user-generated content Traditionally available data can also be tapped to make “thin” data “thicker” Summary Reza Zafarani and Huan Liu. ``Connecting Users across Social Media Sites: A Behavioral-Modeling Approach", SIGKDD, 2013.

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 17 New Challenges in Mining Social Media A Big-Data Paradox Trust vs. Distrust in Social Media Data Sample Sufficiency

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 18 Studying Distrust in Social Media Trust in Social Computing Incorporating Distrust Summary Introduction Applying Trust Representing Trust Measuring Trust WWW2014 Tutorial on Trust in Social Computing Seoul, South Korea. 4/7/14

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 19 Distrust in Social Sciences Distrust could be as important as trust Both trust and distrust may help a decision maker reduce the uncertainty and vulnerability associated with decisions Distrust may play an equally important, if not more, critical role as trust in consumer decisions

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 20 Understanding of Distrust from Social Sciences Distrust is the negation of trust ─ Low trust is equivalent to high distrust ─ The absence of distrust means high trust ─ Lack of the studying of distrust matters little Distrust is a new dimension of trust ─ Trust and distrust are two separate concepts ─ Trust and distrust can co-exist ─ A study ignoring distrust would yield an incomplete estimate of the effect of trust Jiliang Tang, Xia Hu, and Huan Liu. ``Is Distrust the Negation of Trust? The Value of Distrust in Social Media", 25th ACM Conference on Hypertext and Social Media (HT2014), Sept. 1-4, 2014, Santiago, Chile.HT2014

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 21 Distrust in Social Media Distrust is rarely studied in social media Challenge 1: Lack of computational understanding of distrust with social media data – Social media data is based on passive observations – Lack of some information social sciences use to study distrust Challenge 2: Distrust information is usually not publicly available – Trust is a desired property while distrust is an unwanted one for an online social community

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 22 Computational Understanding of Distrust Design computational tasks to help understand distrust with passively observed social media data  Task 1: Is distrust the negation of trust? – If distrust is the negation of trust, we can just use trust information alone  Task 2: Is distrust information valuable? – If distrust is a new dimension of trust, does distrust have added value The first step to understand distrust is to make distrust computable in trust models

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 23 Distrust in Trust Representations There are three major ways to incorporate distrust in trust representation – Considering low trust as distrust – Adding signs to trust values – Adding a dimension in trust representations 0 Trust Distrust 1 0 Trust Distrust 1 0 Trust 1 Distrust

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 24 An Illustration of Distrust in Trust Representations Considering low trust as distrust ─Weighted unsigned network Extending negative values ─Weighted signed network Adding another dimension ─Two-dimensional unsigned network (0.8,0) (1,0) (0,1) (1,0)

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 25 A Computational Understanding of Distrust Social media data is a new type of social data – Passively observed – Large scale Task 1: Is distrust the negation of trust? – Predicting distrust from only trust Task 2: Does distrust have added value on trust? – Predicting trust with distrust

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 26 Task 1: Is Distrust the Negation of Trust? If distrust is the negation of trust, low trust is equivalent to distrust and distrust should be predictable from trust Given the transitivity of trust, we resort to trust prediction algorithms to compute trust scores for pairs of users in the same trust network Distrust Low Trust Predicting Distrust Predicting Low Trust IF THEN ≡ ≡

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 27 Evaluation of Task 1  The performance of using low trust to predict distrust is consistently worse than randomly guessing  It fails to predict distrust with only trust; and distrust is not the negation of trust dTP: It uses trust propagation to calculate trust scores for pairs of users dMF: It uses the matrix factorization based predictor to compute trust scores for pairs of users dTP-MF: It is the combination of dTP and dMF using OR

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 28 Task 2: Can we predict Trust better with Distrust  If distrust is not the negation of trust, distrust may provide additional information about users, and could have added value beyond trust  We further check whether using both trust and distrust information can help achieve better performance than using only trust information  We thus add distrust propagation in trust propagation to incorporate distrust

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 29 Evaluation of Trust and Distrust Propagation  Incorporating distrust propagation into trust propagation can improve the performance of trust measurement  One step distrust propagation usually outperforms multiple step distrust propagation

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 30 Experimental Settings for Task 2 x% of pairs of users with trust relations are chosen as old trust relations and the remaining as new trust relations Task 2 predicts pairs of users P from as new trust relations The performance is computed as

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 31 Findings from the Computational Understanding Task 1 shows that distrust is not the negation of trust – Low trust is not equivalent to distrust Task 2 shows that trust can be better measured by incorporating distrust – Distrust has added value in addition to trust This computational understanding suggests that it is necessary to compute distrust in social media What is the next step of distrust research?

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 32 New Challenges in Mining Social Media A Big-Data Paradox Trust vs. Distrust in Social Media Data Sample Sufficiency

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 33 Twitter provides two main outlets for researchers to access tweets in real time: – Streaming API (~1% of all public tweets, free) – Firehose (100% of all public tweets, costly) Streaming API data is often used by researchers to validate hypotheses. How well does the sampled Streaming API data measure the true activity on Twitter? Sampling Bias in Social Media Data F. Morstatter, J. Pfeffer, H. Liu, and K. Carley. Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API and Data from Twitter’s Firehose. ICWSM, 2013.

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 34 Facets of Twitter Data Compare the data along different facets Selected facets commonly used in social media mining: – Top Hashtags – Topic Extraction – Network Measures – Geographic Distributions

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 35 Preliminary Results Top HashtagsTopic Extraction No clear correlation between Streaming and Firehose data. Topics are close to those found in the Firehose. Network MeasuresGeographic Distributions Found ~50% of the top tweeters by different centrality measures. Graph-level measures give similar results between the two datasets. Streaming data gets >90% of the geotagged tweets. Consequently, the distribution of tweets by continent is very similar.

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 36 How are These Results? Accuracy of streaming API can vary with analysis performed These results are about single cases of streaming API Are these findings significant, or just an artifact of random sampling? How can we verify that our results indicate sampling bias or not?

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 37 Histogram of JS Distances in Topic Comparison This is just one streaming dataset against Firehose Are we confident about this set of results? Can we leverage another streaming dataset? Unfortunately, we cannot rewind after our dataset was collected using the streaming API

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 38 Verification Created 100 of our own “Streaming API” results by sampling the Firehose data.

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 39 Comparison with Random Samples

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 40 Comparative Results Top HashtagsTopic Extraction No correlation between Streaming and Firehose data. Not so in random samples Topics are close to those found in the Firehose. Topics extracted with the random data are significantly better

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 41 Summary Streaming API data could be biased in some facets Our results were obtained with the help of Firehose Without Firehose data, it’s challenging to figure out which facets might have bias, and how to compensate them in search of credible mining results F. Morstatter, J. Pfeffer, H. Liu, and K. Carley. Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API and Data from Twitter’s Firehose. ICWSM, Fred Morstatter, Jürgen Pfeffer, Huan Liu. When is it Biased? Assessing the Representativeness of Twitter's Streaming API, WWW Web Science 2014.

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 42 For this great opportunity for sharing Acknowledgments – Grants from NSF, ONR, and ARO – DMML members and project leaders – Interdisciplinary collaborators Summary – A Big-Data Paradox – Studying Distrust in Social Media – Sampling Bias in Social Media Data THANK YOU …

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab December 12, 2014 HKUST 43 Big Data is a good problem to have There are many new problems and opportunties Together, we can harness it for science & engineering Concluding Remarks