Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Similar presentations


Presentation on theme: "Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree."— Presentation transcript:

1 Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

2 Introduction Growing interest in monitoring disease outbreaks. Growing of twitter users - February, 201050 million tweets/day - June, 201065 million tweets/day (750 tweets/s - 190 million users Source: http://en.wikipedia.org/wiki/Twitterhttp://en.wikipedia.org/wiki/Twitter

3 Introduction Twitter is a website, which offers a social networking and micro-blogging service. - Users send and read messages called “tweets” (140 characters)

4 Introduction Advantages of Twitter for this research - Full messages provide more information than query. - Twitter profiles contain more detail to analyze. (city, state, gender, age) - Diversity of twitter users.

5 Methodology Data - Collect 574,643 messages for 10 weeks (February 12, 2010 to April 24, 2010) - The US Centers for Disease Control and Prevention (CDC) publishes the US Outpatient Influenza-like Illness Surveillance Network (ILINet)

6 Methodology The Ground truth ILI rates obtained from the CDC statistics

7 Methodology Regression Models 1. Simple linear regression P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match W = D = a document collection D w = a document frequency for word W logit(x) =

8 Methodology Regression Models 2. Multiple linear regression P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match W i = D = a document collection D wi = a document frequency for word W i logit(x) =

9 Methodology Keyword Selection 1.Correlation Coefficient - Simple linear regression model evaluation 2. Residual Sum of Squares (RSS) - It measures a discrepancy between the data and an estimation model

10 Methodology Keyword Generation 1.Hand-chosen keywords (flu, cough, sore throat, headache) 2.Most frequent keywords - Search all documents containing any of hand-chosen keywords. - Find the top 5,000 most frequently occurring words.

11 Methodology Document Filtering - Applying logistic regression to predict whether a Twitter message is reporting an ILI symptom. y i = a binary random variable (1 if document D i is positive, 0 otherwise) x i = {x ij } = number of times word j appears in document i

12 Methodology

13 Classification evaluation - Accuracy - Precision - Recall - F-measure

14 Results Document Filtering Evaluation of messages classification with standard error in parentheses

15 Results Regression The 10 different systems evaluated

16 Results Regression The regression coefficient (r), residual sum of square (RSS), and standard error of each system

17 Results Results for multi-hand-rss(2)Results for classification-hand

18 Results Results for multi-freq-rss(3) Results for simple-hand-rss(1)

19 Results Correlation results for simple –hand-rss and multi-hand-rss Correlation results for simple –hand-corr and multi-hand-corr

20 Results Correlation results for simple –freq-rss and multi-freq-rss Correlation results for simple –freq-corr and multi-freq-corr

21 Conclusion Several methods to identify influenza-related messages. Compare a number of regression models to correlate the messages with CDC statistics. The best model achieves correlation of.78.


Download ppt "Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree."

Similar presentations


Ads by Google