Download presentation
Presentation is loading. Please wait.
1
B. Aditya Prakash Department of Computer Science
Leveraging Propagation for Data Mining Models, Algorithms, Applications B. Aditya Prakash Department of Computer Science Social Computing Workshop, ARL, Sept 28, 2016
2
Dynamical Processes over networks are also everywhere!
Prakash 2016
3
Why do we care? ........ Social collaboration Information Diffusion
Viral Marketing Epidemiology and Public Health Cyber Security Human mobility Games and Virtual Worlds Ecology Prakash 2016
4
Why do we care? (1: Epidemiology)
Dynamical Processes over networks [AJPH 2007] CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts Diseases over contact networks Prakash 2016
5
Why do we care? (1: Epidemiology)
Dynamical Processes over networks Each circle is a hospital ~3000 hospitals More than 30,000 patients transferred Mention number of hospitals Patients transferred [US-MEDICARE NETWORK 2005] Problem: Given k units of disinfectant, whom to immunize? Prakash 2016
6
Why do we care? (1: Epidemiology)
~6x fewer! [US-MEDICARE NETWORK 2005] CURRENT PRACTICE OUR METHOD Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Prakash 2016
7
Why do we care? (2: Online Diffusion)
> 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash 2016
8
Why do we care? (2: Online Diffusion)
Dynamical Processes over networks Buy Versace™! Celebrity Followers Social Media Marketing Prakash 2016
9
Why do we care? (3: To change the world?)
Dynamical Processes over networks Social networks and Collaborative Action Prakash 2016
10
High Impact – Multiple Settings
epidemic out-breaks Q. How to squash rumors faster? Q. How do opinions spread? Q. How to market better? products/viruses transmit s/w patches Prakash 2016
11
Large real-world networks & processes
Research Theme ANALYSIS Understanding POLICY/ ACTION Managing DATA Large real-world networks & processes Prakash 2016
12
Research Theme – Social Media
ANALYSIS # cascades in future? POLICY/ ACTION How to market better? DATA Modeling Tweets spreading Prakash 2016
13
Research Theme – Public Health
ANALYSIS Will an epidemic happen? POLICY/ ACTION How to control out-breaks? DATA Modeling # patient transfers Prakash 2016
14
Large real-world networks & processes
In this talk Using propagation for _________ Q1: Syndromic Surveillance Q2: Memes, Tweets, Blogs Q3: Summarization & Communities. Applications Large real-world networks & processes Prakash 2016
15
Applications Using propagation for _________
Q1: Syndromic Surveillance Q2: Memes, Tweets, Blogs Q3: General Graph Mining Prakash 2016
16
Surveillance How to estimate and predict flu trends?
[Chen et. al. ICDM 2014] How to estimate and predict flu trends? Population survey Hospital record Lab survey Surveillance Report Prakash 2016
17
GFT & Twitter Estimate flu trends using online electronic sources
Prakash 2016
18
Flu forecasting Twitter – a surrogate for flu forecasting?
Google Flu Trends: using keywords to track the flu season Can we get more specific? Consider: Prakash 2016
19
“Propagation” ideas Can we develop better disease surveillance tools by leveraging How flu-related information propagates on Twitter Epidemiological models Prakash 2016
20
Observation 1: States There are different states in an infection cycle. SEIR model: 1. Susceptible Exposed 3. Infected Recovered Prakash 2016
21
Observation 2: Ep. & So. Gap
Infection cases drop exponentially in epidemiology (Hethcote 2000) Keyword mentions drop in a power-law pattern in social media (Matsubara 2012) Prakash 2016
22
Flu Forecasting Using combination of propagation patterns, develop a hidden flu-state topic model Learn “flu” vocabulary and transition probabilities Prakash 2016
23
HFSTM Model Details Hidden Flu-State from Tweet Model (HFSTM)
Each word (w) in a tweet (Oi) can be generated by: A background topic Non-flu related topics State related topics Latent state Initial prob. Transit. switch Binary non-flu related switch Transit. prob. Binary background switch Word distribution Prakash 2016
24
HFSTM Model Details Generating tweets Generate the state for a tweet
Generate the topic for a word State: [S,E,I] Topic: [Background, Non-flu, State] S: good This restaurant is really E: The movie was but it freezing I: I think have flu Prakash 2016
25
Inference Details EM-based algorithm: HFSTM-FIT E-step: M-step:
At(i)=P(O1,O2,…,Ot,St=i) Bt(i)=P(Ot+1,…,OTu|St=i) γt(i)=P(St=i|Ou) M-step: Other parameters such as state transition probabilities, topic distributions, etc. Parameters learned: Prakash 2016
26
A possible issue with HFSTM
Suffers from large, noisy vocabulary. Semi-supervision for improvement Introduce weak supervision into HFSTM. Prakash 2016
27
HFSTM-A Details HFSTM-A(spect)
[Chen et. al. DAMI 2015] HFSTM-A(spect) Introduce an aspect variable y, expressing our belief on whether a word is flu-related or not. The value of y biases the switch variables s.t. flu-related words are more likely to be explained by state topics. When the aspect value (y) is introduced, the switching probability are updated accordingly. Prakash 2016
28
Vocabulary & Dataset Vocabulary (230 words): Dataset (34,000 tweets):
Flu-related keyword list by Chakraborty SDM 2014 Extra state-related keyword list Dataset (34,000 tweets): Identify infected users and collect their tweets Train on data from Jun 20, 2013-Aug 06, 2013 Test on two time period: Dec 01, July 08, 2013 Nov 10, 2013-Jan 26, 2014 Prakash 2016
29
Learned word distributions
The most probable words learned in each state Probably healthy: S Having symptons: E Definitely sick: I Prakash 2016
30
Learned state transition
Transition probabilities Transition in real tweets Learned by HFSTM: Not directly flu-related, yet correctly identified Prakash 2016
31
Flu trend fitting Ground-truth: Algorithms:
The Pan American Health Organization (PAHO) Algorithms: Baseline: Count the number of keywords weekly as features, and regress to the ground-truth curve. Google flu trend: Take the google flu trend data as input, regress to the PAHO curve. HFSTM: Distinguish different states of keyword, and only use the number of keywords in I state. Again regress to PAHO. Prakash 2016
32
Flu trend fitting Linear regression to the case count reported by PAHO (the ground-truth) Prakash 2016
33
HFSTM-A Results are qualitatively similar with HFSTM, when the vocabulary is 10 times larger. Prakash 2016
34
Applications Using propagation for _________
Q1: Syndromic Surveillance Q2: Memes, Tweets, Blogs Q3: General Graph Mining Prakash 2016
35
Memetracking Memes – a virally transmitted cultural symbol or social idea (first coined by Richard Dawkins in 1976) Usually text (a phrase) and/or an image A viral meme from 2012 Olympics All the way to the White House Prakash 2016
36
Patterns Anomaly Imputation Compression Extrapolation Prakash 2016
37
Google Search Volume ? ? e.g., given (1) first spike,
(2) release date of two sequel movies (3) access volume before the release date (1) First spike (2) Release date (3) Two weeks before release ? ? Prakash 2016
38
Rise and fall patterns in social media
Meme (# of mentions in blogs) short phrases Sourced from U.S. politics in 2008 “you can put lipstick on a pig” “yes we can” Prakash 2016
39
Rise and fall patterns in social media
Can we find a unifying model, which includes these patterns? four classes on YouTube [Crane et al. ’08] six classes on Meme [Yang et al. ’11] Prakash 2016
40
Rise and fall patterns in social media
Answer: YES! We can represent all patterns by single model In Matsubara+ SIGKDD 2012 Prakash 2016
41
Main idea - SpikeM β 1. Un-informed bloggers (uninformed about rumor)
2. External shock at time nb (e.g, breaking news) 3. Infection (word-of-mouth) β Time n=0 Time n=nb Time n=nb+1 Infectiveness of a blog-post at age n: Strength of infection (quality of news) Decay function (how infective a blog posting is) Power Law Prakash 2016
42
-1.5 slope J. G. Oliveira et. al. Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature 437, 1251 (2005) . [PDF] (also in Leskovec, McGlohon+, SDM 2007) Prakash 2016
43
SpikeM - with periodicity
Details SpikeM - with periodicity Full equation of SpikeM Periodicity 12pm Peak activity 3am Low activity Time n Bloggers change their activity over time (e.g., daily, weekly, yearly) activity Prakash 2016
44
Tail-part forecasts SpikeM can capture tail part Prakash 2016
45
“What-if” forecasting
e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date (1) First spike (2) Release date (3) Two weeks before release ? ? Prakash 2016
46
“What-if” forecasting
SpikeM can forecast not only tail-part, but also rise-part! SpikeM can forecast upcoming spikes (1) First spike (2) Release date (3) Two weeks before release Prakash 2016
47
Bonus: Protest Predictions
Violent Protest (VP) [Sundereisan et al. ASONAM 2014] [Jin et al. SIGKDD 2014] Can Twitter provide a lead time? South American twitter dataset Language: Spanish/Portuguese Idea Look for trending keywords. Predict event type for protest using SpikeM parameters! VP A political tweet Non Violent Protest (P) P Prakash 2016
48
Propagation and Cyber-Security: Temporal Patterns
[Papalexakakis et al. ASONAM 2013] Propagation and Cyber-Security: Temporal Patterns Looks familiar? Prakash 2016
49
Propagation and Cyber-Security: Ensemble Models
[Chan et. Al. WSDM 2016] Propagation and Cyber-Security: Ensemble Models Prakash 2016
50
Applications Using propagation for _________
Q1: Syndromic Surveillance Q2: Memes, Tweets, Blogs Q3: General Graph Mining Prakash 2016
51
Example 1: Missing data correction
Prakash 2016
52
Real data is noisy! ? ? We don’t know who exactly are infected
Epidemiology Public-health surveillance CDC Lab Hospital Not sure ? Only a proportion of true infected people are found to be infected (false negative). Error rate. Fraction/proportion. That is a typical false negative for CDC surveillance CNN headlines ? Surveillance Pyramid [Nishiura+, PLoS ONE 2011] Not sure Each level has a certain probability to miss some truly infected people Prakash 2016
53
Real data is noisy! Correcting missing data is by itself very important Social Media Twitter: due to the uniform samples [Morstatter+, ICWSM 2013], the relevant ‘infected’ tweets may be missed Tweets Missing ? Sampled Tweets ? Missing Sampling Prakash 2016
54
The Problem GIVEN: FIND: Graph G(V, E) from historical data
[Sudareisan, Vreeken, Prakash SDM 2015] [Rozenshtein et al. SIGKDD 2016] The Problem GIVEN: Graph G(V, E) from historical data Infected set D V, sampled (p%) and incomplete Infectivity β of the virus FIND: Seed set i.e. patient zeros/culprits Set C- (the missing infected nodes) Ripple R (the order of infections) Prakash 2016
55
Visualizing Performance (Grid connected)
NetSleuth Seeds Missing nodes Simulation Seeds Missing nodes Frontier Seeds Missing nodes NetFill Seeds Missing nodes Two seeds then sampling. Tried different algos. Legend: Correct FP FN Seeds Infected Prakash 2016
56
Meme-Tracker– case study
96,000 node graph for the meme “State of the economy” Found missing websites like “ “chicagotribune.com” and some blog posts. 2008 crash. Powerful. Recovered nodes not in dataset. Prakash 2016
57
Example 2: “Zoom-out” of the network
“Zoom-out” of the cascade graph to get a quick picture (= summarization) A D D A Zoom-out C C B B F E F E Smaller representation of the network Big graph Coarsening [Purohit, Prakash, et, al. SIGKDD 2014] Prakash 2016
58
CoarseNet: algorithm Step 1: compute scores for all edge pairs
2: Merge nodes with smallest score 3. Goto step 1 until αn nodes left Assigning scores Merging edges Original Network (weight=0.5) Coarsened Network Prakash 2016
59
Application 1: Influence Maximization
Methodology: Step 1: Coarsen the large social network using CoarsenNet Step 2: Solve influence maximization on the coarsened network Step 3: Randomly select one node from each selected “supernode” D A Step 2: Solve influence maximization Step 1: Coarsen C B D F A E C B F E We call it CSPIN Step 3: Randomly select one node from C Prakash 2016
60
Application 2: Diffusion Characterization
Goal: use Graph Coarsening to understand information cascades Dataset: Flixster a friendship network with movie ratings Cascade: the same movie rating from friends Methodology coarsen the network using CoarseNet with the reduction factor α=0.5 study the formed groups (supernodes) Can get non-network surrogates Prakash 2016 Purohit, Prakash, Kang, Zhang, Subrahmanian 2014
61
Diffusion observation
Stats: 1891 groups mean group size: 16.6 the largest group: nodes (roughly 40% of nodes) Observation 1: a very large fraction of movies propagate in a small number of groups Observation 2: a multi-modal distribution Prakash 2016
62
Things I won’t talk about Theory
Fundamental Models Understanding Prakash 2016
63
Main questions 1. When will a virus take-off on a network? [ICDM 2012] 2. What happens if the networks vary with time? [PKDD 2010] 3. What happens if multiple viruses compete? (‘winner-takes-all’) [WWW 2012] Prakash 2016
64
More… 3. Interacting viruses Phase Transition for co-existence vs extinction [SIGKDD 2012] 4. Composite Networks (e.g. communication vs power-grid networks) depends on the networks [IEEE J. on Selected Areas in Comm. (JSAC) 2013] vs Prakash 2016
65
Managing/Manipulating
Algorithms Policy/Action Managing/Manipulating Prakash 2016
66
Alg 1: Immunization (= Interventions)
Different Flavors: Pre-emptive Data-aware Prakash 2016
67
Immunizations as Network manipulation
Node based [Tong, P., + ICDM 2010] Edge-based [Tong, P., + CIKM 2012, Best Paper Award] Edge-Manipulation [P., Adamic+ SDM 2013] Prakash 2016
68
Latest results First (provable) approximation algorithms for edge-based problem [Saha, Adiga, P., Vullikanti SDM 2015]) O(log^2 n)--factor (can be improved to O(log n)) Based on the idea of removing closed walks Semi-Definite Programming Rounding-based O(1) factor Prakash 2016
69
Data-aware Immunization
[Zhang and Prakash, SDM 2014 Zhang and Prakash, TKDD 2015] Given: Graph and Infected nodes Find: ‘best’ nodes for immunization Complexity NP-hard Hard to approximate within an absolute error DAVA-tree Optimal solution on the tree DAVA and DAVA-fast Merging infected nodes Build a “dominator tree”, and run DAVA-tree Running time: subquadratic DAVA: O(k(|E|+ |V|log|V|)) DAVA-fast: O(|E|+|V|log|V|) Graph with infected nodes Dominator tree Prakash 2016
70
Extensions Can be extended to Uncertain and noisy initial data as well! [Zhang and Prakash, CIKM 2014] Twitter Firehose API 1% sample Prakash 2016
71
Group-based Immunization
[Zhang, Adiga, Vullikanti, Prakash, 2015] How to select groups to minimize the epidemic? Epidemiology People are grouped by ages, demographics, occupations … Social Media Friends are grouped by the same interests E.g., Facebook pages A D C B Results: First approximation algorithms for the problem F Prakash 2016 E
72
Large real-world networks & processes
Conclusion: Theme ANALYSIS Understanding POLICY/ ACTION Managing DATA Large real-world networks & processes Prakash 2016
73
Scalability – Big Data Need scalable algorithms for
Datasets of unprecedented scale High dimensionality and sample size! Need scalable algorithms for Learning Models Developing Policy Leverage parallel systems Map-Reduce clusters (like Hadoop) for data-intensive jobs (more than 6000 machines) Parallelized compute-intensive simulations (like Condor) Prakash 2016
74
Effect on Community Structure
Example: Twitter network where tweets diffuse over followee-follower network users who boost the diffusion ("bridges/media nodes") influential users (“kernels") Communities detected by NEWMAN’s algorithm Original Network Ideal communities and roles of nodes: kernel, media, ordinary nodes Cannot capture different roles in diffusion! Prakash 2016 74
75
Summarization and Segmentation
Automatic segmentation? Segment cascades? ……. Prakash 2016
76
Extensions Temporal graphs Noisy data
Incorporating Richer Attributed graphs Heterogeneous graphs …. Prakash 2016
77
Propagation on Networks
Biology Theory & Algo. Physics Comp. Systems Propagation on Networks Social Science ML & Stats. Econ. Prakash 2016
78
Acknowledgements Collaborators Christos Faloutsos
Roni Rosenfeld, Michalis Faloutsos, Lada Adamic, Theodore Iwashyna (M.D.), Dave Andersen, Tina Eliassi-Rad, Iulian Neamtiu, Varun Gupta, Jilles Vreeken, V. S. Subrahmanian John Brownstein (M.D.) Deepayan Chakrabarti, Hanghang Tong, Kunal Punera, Ashwin Sridharan, Sridhar Machiraju, Mukund Seshadri, Alice Zheng, Lei Li, Polo Chau, Nicholas Valler, Alex Beutel, Xuetao Wei Prakash 2016
79
Acknowledgements Students Liangzhe Chen Shashidhar Sundereisan
Benjamin Wang Yao Zhang Sorour Amiri Bijaya Adhikari Prakash 2016
80
Acknowledgements Funding Prakash 2016
81
Propagation for Data Mining
B. Aditya Prakash Analysis Policy/Action Data Prakash 2016
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.