Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Slides:



Advertisements
Similar presentations
Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter Eiji ARAMAKI * Sachiko MASKAWA * Mizuki MORITA ** * The University of Tokyo ** National.
Advertisements

Comparability of Electronic and Manual Influenza-like Illness (ILI) Surveillance Methods Robin M. Williams, Nebraska Department of Health & Human Services/University.
GEO SYMPTOM SOLUTIONS Anurag Jain. Method of reach Content Categorization User Categorization based on site usage and declared information Scale for WebMD.
Lisa Grohskopf, MD, MPH Medical Officer Influenza Division Centers for Disease Control and Prevention U.S. Influenza Surveillance National Center for Immunization.
U.S. Surveillance Update Anthony Fiore, MD, MPH CAPT, USPHS Influenza Division National Center for Immunizations and Respiratory Disease Centers for Disease.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Standard Error of the Estimate Goodness of Fit Coefficient of Determination Regression Coefficients.
Monitoring Influenza Trends though Mining Social Media By Courtney D Corley, Armin R Mikler, Karan P Singh, and Diane J Cook Jedsada Chartree 02/07/2011.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Chapter Topics Types of Regression Models
Google Flu Trends Terminology –Influenza = flu –ILI = influenza like illness CDC ILI time series –Weekly –1-2 week publication lag Predicting it using.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Simple Linear Regression NFL Point Spreads – 2007.
Wisconsin Department of Health Services
Combining Content-based and Collaborative Filtering Department of Computer Science and Engineering, Slovak University of Technology
Chapter 11 Simple Regression
Simple Linear Regression
Influenza-like Illness Surveillance at the National Level
Detecting influenza epidemics using search engine query data Jeremy Ginsberg1, Matthew H. Mohebbi1, Rajan S. Patel1, Lynnette Brammer2, Mark S. Smolinski1.
Sore throat? Sniffles?Sore throat? Sniffles?  Google it! Duh!  During flu season, more people enter search queries concerning the flu.  Each year 90.
Statistical Methods Statistical Methods Descriptive Inferential
Texas Influenza Surveillance Lesley Bullion Influenza Surveillance Coordinator Infectious Disease Control Unit.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Calculation of excess influenza mortality for small geographic regions Al Ozonoff, Jacqueline Ashba, Paola Sebastiani Boston University School of Public.
1 Using ESSENCE-FL and a serosurvey to estimate total influenza infections, 2009 Richard S. Hopkins, MD, MSPH Kate Goodin, MPH Mackenzie Weise, MPH Aaron.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Ordinary Least Squares Regression.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Introduction for Basic Epidemiological Analysis for Surveillance Data National Center for Immunization & Respiratory Diseases Influenza Division.
Detecting Influenza Outbreaks by Analyzing Twitter Messages By Aron Culotta Jedsada Chartree 02/28/11.
Members: Raghuram Krishnamachari Manish Maheshwari Maryam El Kherba Guided by: Prof. Alan Mislove.
Chapter 6 Simple Regression Introduction Fundamental questions – Is there a relationship between two random variables and how strong is it? – Can.
Texas ILINet Structure and Operation th Annual Texas Public Health Association Conference April 21-23, 2010 South Padre Island, Texas Gary.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Chapter 6 (cont.) Difference Estimation. Recall the Regression Estimation Procedure 2.
EVALUATION OF THE RADAR PRECIPITATION MEASUREMENT ACCURACY USING RAIN GAUGE DATA Aurel Apostu Mariana Bogdan Coralia Dreve Silvia Radulescu.
Medical Statistics (full English class) Ji-Qian Fang School of Public Health Sun Yat-Sen University.
Swine Flu & You! Information Regarding the Possible Approaching Swine Flu Pandemic.
U.S. Outpatient Influenza-Like Illness Surveillance Network (ILINet) Neil Pascoe for Irene Brown.
Linear Prediction Correlation can be used to make predictions – Values on X can be used to predict values on Y – Stronger relationships between X and Y.
Stat 112 Notes 6 Today: –Chapter 4.1 (Introduction to Multiple Regression)
1 Simple Linear Regression and Correlation Least Squares Method The Model Estimating the Coefficients EXAMPLE 1: USED CAR SALES.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Is there a correlation between the number of hours a student works and the number of credit hours they are enrolled in? Data Compiled and presented by:
Topics, Summer 2008 Day 1. Introduction Day 2. Samples and populations Day 3. Evaluating relationships Scatterplots and correlation Day 4. Regression and.
CVD Testing the H1N1 Pandemic Flu Vaccines Mini-Med School Karen Kotloff, MD University of Maryland School of Medicine Center for Vaccine Development September.
Correlation of National Influenza Surveillance Data to the Local Experience Kate Goodin, MPH Florida Department of Health Bureau of Epidemiology 6 th Annual.
Introduction Many problems in Engineering, Management, Health Sciences and other Sciences involve exploring the relationships between two or more variables.
Forecasting. Model with indicator variables The choice of a forecasting technique depends on the components identified in the time series. The techniques.
Regression and Correlation of Data Correlation: Correlation is a measure of the association between random variables, say X and Y. No assumption that one.
Lecture 9 Forecasting. Introduction to Forecasting * * * * * * * * o o o o o o o o Model 1Model 2 Which model performs better? There are many forecasting.
Regression and Correlation of Data Summary
Chapter 7. Classification and Prediction
REGRESSION G&W p
Flu Update and Overview of Flu Surveillance in RI
Linear Regression.
Ch12.1 Simple Linear Regression
S519: Evaluation of Information Systems
Linear Regression.
Michael M. Wagner, MD PhD Professor, Department of Biomedical Informatics, University of Pittsburgh School of Medicine
One Health Early Warning Alert
Influenza-like Illness Surveillance at the National Level
Prediction of new observations
The Multiple Regression Model
The Least-Squares Line Introduction
Predicting Prevalence of Influenza-Like Illness From Geo-Tagged Tweets
Analyzing social media data to monitor public health trends
Cases. Simple Regression Linear Multiple Regression.
2015 NINR Big Data in Symptoms Research Boot Camp Overview
Presentation transcript:

Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Introduction Growing interest in monitoring disease outbreaks. Growing of twitter users - February, million tweets/day - June, million tweets/day (750 tweets/s million users Source:

Introduction Twitter is a website, which offers a social networking and micro-blogging service. - Users send and read messages called “tweets” (140 characters)

Introduction Advantages of Twitter for this research - Full messages provide more information than query. - Twitter profiles contain more detail to analyze. (city, state, gender, age) - Diversity of twitter users.

Methodology Data - Collect 574,643 messages for 10 weeks (February 12, 2010 to April 24, 2010) - The US Centers for Disease Control and Prevention (CDC) publishes the US Outpatient Influenza-like Illness Surveillance Network (ILINet)

Methodology The Ground truth ILI rates obtained from the CDC statistics

Methodology Regression Models 1. Simple linear regression P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match W = D = a document collection D w = a document frequency for word W logit(x) =

Methodology Regression Models 2. Multiple linear regression P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match W i = D = a document collection D wi = a document frequency for word W i logit(x) =

Methodology Keyword Selection 1.Correlation Coefficient - Simple linear regression model evaluation 2. Residual Sum of Squares (RSS) - It measures a discrepancy between the data and an estimation model

Methodology Keyword Generation 1.Hand-chosen keywords (flu, cough, sore throat, headache) 2.Most frequent keywords - Search all documents containing any of hand-chosen keywords. - Find the top 5,000 most frequently occurring words.

Methodology Document Filtering - Applying logistic regression to predict whether a Twitter message is reporting an ILI symptom. y i = a binary random variable (1 if document D i is positive, 0 otherwise) x i = {x ij } = number of times word j appears in document i

Methodology

Classification evaluation - Accuracy - Precision - Recall - F-measure

Results Document Filtering Evaluation of messages classification with standard error in parentheses

Results Regression The 10 different systems evaluated

Results Regression The regression coefficient (r), residual sum of square (RSS), and standard error of each system

Results Results for multi-hand-rss(2)Results for classification-hand

Results Results for multi-freq-rss(3) Results for simple-hand-rss(1)

Results Correlation results for simple –hand-rss and multi-hand-rss Correlation results for simple –hand-corr and multi-hand-corr

Results Correlation results for simple –freq-rss and multi-freq-rss Correlation results for simple –freq-corr and multi-freq-corr

Conclusion Several methods to identify influenza-related messages. Compare a number of regression models to correlate the messages with CDC statistics. The best model achieves correlation of.78.