Presentation on theme: "Learning About Medicine by Applying Machine Learning to User Generated Content: The Case of Anorexia Elad Yom-Tov Microsoft Research Israel."— Presentation transcript:
Learning About Medicine by Applying Machine Learning to User Generated Content: The Case of Anorexia Elad Yom-Tov Microsoft Research Israel
Why medicine? People use the Internet extensively: –More than 77% of USA population use the Internet –Every day, 55% of Americans use the Internet. They do so for an average of two hours. –More than 80% of Internet users search for medical information online, and significant medically-related activities happen on the Internet Large-scale medical trials are expensive and time consuming. Making sense of Internet data requires processing large amounts of data to produce meaningful insights *Pew survey, 20102
A lifestyle choice? Thin is perfection, I'll die trying to achieve it Anorexia is a lifestyle, not a diet I only feel beautiful when I'm hungry
Data: Users All users who posted at least two photographs with a relevant tag (thinspo, thinspiration, pro-ana) –162 users All users who posted to eating disorder groups on Flickr –71 users Users who commented or favorited to at least two of the above-mentioned photos –683 users
Data: Photos and links Raw data: –543,891 photographs –2,229,489 comments –642,317 favorite markings –237,165 contact links Labeling: –Users were labeled on a 5-point scale. Kappa = 0.51 (p<10 -5 )
Tag similarity Modeled users with a TF-IDF weighted bag- of-tags Average Cosine similarity: –Pro-anorexia: –Pro-recovery: –Pro-recovery to pro-anorexia: –ROC: 0.52 –Tag usage: thinspiration: 37% pro-anorexia, 7% pro-recovery pro-anorexia: 1.7% pro-anorexia, 2.4% pro-recovery
Is exposing pro-anorexia users to pro-recovery comments an effective intervention? Comments by... PAPR Posted by... PA61%46% PR61%71%
Hazard model Class Pro-AnorexiaPro-recovery All previous times Number of photos Number of highly relevant photos Number of views Number of views of highly relevant photos Number of comments from same-class users Number of comments from other-class users Fraction of comments from same-class users Recent Number of photos Number of highly relevant photos Number of views Number of views of highly relevant photos Number of comments from same-class users Number of comments from other-class users Fraction of comments from same-class users
How do they get there?
Data Toolbar data over a period of 5 months, in which we identified two types of behavior: Celebrity queries One of 3640 known celebrities Each scored for the probability of them appearing in conjunction with the word anorexia We refer to this probability as the Perceived Anorexia Score (PAS). Anorexia queries We define anorexic activity searching (AAS) as one of the following: 1.Tips for proana or anorexia 2.how to … and proana or anorexia. 3.Proana buddy A total of 5,800,270 users searched for least one celebrity in the top 2.5% of PAS, of which 3,615 also made AASs.
Clustering Start with a matrix of users by celebrities –9,188,983 users by 3,640 celebrities Cluster using k- means with cosine similarity Clusters are statistically significant by PAS, but not by occupation. 15
Hazard models AttributesModel 1Model 2 Weight (s.e.) Exp(weight ) Weight (s.e.) Exp(weight ) Number of all searches 1.4*10 -3 (5*10 -5 ) *10 -3 (5*10 -5 ) 1.00 Number of celebrity searches 1.5*10 -4 (0.011) *10 -3 (0.011) N.S Number of searches for top PAS celebrities (0.008) *10 -2 (0.012) 1.07 Number of (unique) top PAS celebrities searched (0.061) 1.65
Adding the media effect The Spearman correlation between the number of queries for a celebrity and the number of tweets was 0.63, so the bigger the peak (the media buzz), the more searches will occur. When focusing on queries and tweets which mentioned anorexia, this correlation is AAS searchers were 1.9 times more likely to query for a high PAS celebrity in the days following a media peak compared to all other people, and 2.4 times more likely when the peak was associated with anorexia.
Hazard models revisited AttributesN = 1N = 7 Weight (s.e.) Exp(weight) Weight (s.e.) Exp(weight) Number of all searches 1.35*10-3 (5.31*10-5) *10-3 (5.31*10-5) 1.00 Number of celebrity searches -2.06*10 -3 (1.10*10 -2 ) N.S *10 -3 (1.11*10 -2 ) N.S Number of searches for top PAS celebrities 3.24*10 -3 (1.10*10 -2 ) *10 -3 (1.11*10 -2 ) 1.03 Number of (unique) top PAS celebrities searched 0.61 (5.70*10 -2 ) (0.06) 1.83 Peak in all Twitter activity 0.29 (0.11) (0.07) 1.33 Peak in Twitter activity related to anorexia (0.13) N.S (0.10) 0.77
Why is this interesting?
Summary As people spend ever more time on the Internet, they generate content which we can use to understand (and later hopefully improve) health and healthcare This content is especially useful when: –People have less of an incentive to lie, compared to the real world –Collecting data in the real world is hard –Activity is largely web-driven BUT: Making sense of so much data requires integrating Machine Learning research with medical practice.