Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Preparation– Project 3: Part II Steve Qian He Prof. Carolina Ruiz CS 548 – Data Mining.

Similar presentations


Presentation on theme: "Data Preparation– Project 3: Part II Steve Qian He Prof. Carolina Ruiz CS 548 – Data Mining."— Presentation transcript:

1 Data Preparation– Project 3: Part II Steve Qian He Prof. Carolina Ruiz CS 548 – Data Mining

2 Project Description Data Collection Data Preprocessing Data Transformation Results Overview

3 Find the word set in Diabetes domain Find the associations between words in this set Project Description Diabetes sugar chocolate insulin ice cream wound

4 Project Description Data Collection Data Preprocessing Data Transformation Results Overview

5 Source: Twitter Public Timeline format: json, xml, rss, atom Tool: The Archivist Desktop Version (You can do this with any programming language. But please don’t waste your time reinventing the wheel…) Data Collection

6 C++: Twitcurl Java: Twitter4J.NET: Twitterizer PHP: TmhOAuth Python: Tweepy Ruby: Twitter Perl: Net::Twitter Objective-C: MGTwitterEngine Data Collection Just in case you want to REINVENT it… Libraries for different programming languages:

7 Just in case you want to REINVENT the tools for REINVENTING the wheel… 1.Parse XML (RSS, Atom), JSON with your language; 2.Follow the Twitter API resource documentation. https://dev.twitter.com/docs Data Collection

8 Kept “The Archivist” running on my lab computer for about 7 days (Mar. 20 – Mar 27.)

9 Data Collection Collected 40,545 tweets from Twitter with keyword “diabetes”.

10 Project Description Data Collection Data Preprocessing Data Transformation Results Overview

11 Why do we need to preprocess the data before importing it into Weka? Weka doesn’t understand the file format. We only care about “instance” (tweet) and “attribute” (word). We need manually pick some words (attributes) which make sense in Diabetes domain. Data Preprocessing

12 1.Choose the high-frequency words: “t co http rt a to in the de of and for i you with type is have la s y that it terlalu new has el may en weight juan my diet what on que loss can risk me surgery study at he manis some now this or un does d be para - your blood kicks” 2.Remove the obvious noise tweets. “Juan has 40 chocolate bars. He eats 35. What does Juan have now? Diabetes. Juan has diabetes.” – cold joke… “Weight-loss surgery may stem diabetes in some – Two new clinical trials show that patients pop news Data Preprocessing The “word” here means a substring of a string separated by one of “!\"#$%&'()*+,-./ :; r\f€‚‚„„” characters. My bad, we have better ways to do this… The “word” here means a substring of a string separated by one of “!\"#$%&'()*+,-./ :; r\f€‚‚„„” characters. My bad, we have better ways to do this… java.util.StringTokenizer does not do a good job there. Please use java.util.regex instead.

13 Data Preprocessing 31,621 tweets remained after filtering obvious noises 31 meaningful words selected from 150 words (with minimum frequency 0.1) “surgery weight loss alert risk health treatment bariatric high help medicine disease remission record test study heart combat diet sugar chocolate obesity con pro blood leading learn support diabetic stroke patient”

14 Project Description Data Collection Data Preprocessing Data Transformation Results Overview

15 Generate Weka file (.arff) Instance: tweet Attribute: selected word in tweet e.g. “Experimental study suggests lack of sleep may pose risk for development of Diabetes.” Data Transformation

16 @relation diet weight risk surgery 0,0,1,0 0,0,0,0 0,1,0,0 0,0,0,0 … Weka File

17 Project Description Data Collection Data Preprocessing Data Transformation Results Overview

18 Results I got from Weka: 1.surgery loss ==> weight conf:(1) lift:(10.41) lev:(0.04) [1141] 2.bariatric ==> surgery conf:(0.95) lift:(7.68) lev:(0.03) [901] 3.loss ==> weight conf:(0.94) lift:(9.77) lev:(0.08) [2451] 4.surgery weight ==> loss conf:(0.92) lift:(10.03) lev:(0.04) [1137] Results

19 Word segmentation “I like ice cream.” Or “I like ice cream.” Polysemy “I'm sick of sugar!” “For people with diabetes, being sick can also affect blood sugar levels.” I need a “cold joke & news” detection tool!! Future Work

20 Thanks Q & A


Download ppt "Data Preparation– Project 3: Part II Steve Qian He Prof. Carolina Ruiz CS 548 – Data Mining."

Similar presentations


Ads by Google