# School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r.

## Presentation on theme: "School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r."— Presentation transcript:

School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r – L i n e a r i t i e s, & t h e A l m o n d-D G m o d e l Danai Koutra B.Aditya Prakash Vasileios Koutras Christos Faloutsos PAKDD, 15-17 April 2013, Gold Coast, Australia

CMU AUEB Questions we answer (1)  Patterns: If Bob executes task x for n x times, how many times does he execute task y?  Modeling: Which 2-d distribution fits 2-d clouds of points? 2 © Danai Koutra (CMU) - PAKDD'13 # of ‘Smith’ (100 calls, 700 sms)

CMU AUEB Questions we answer (2)  Patterns: If Bob executes task x for n x times, how many times does he execute task y?  Modeling: Which 2-d distribution fits 2-d clouds of points? 3 © Danai Koutra (CMU) - PAKDD'13 # of

CMU AUEB Let’s peek... … at our contributions  Patterns: power laws between competing tasks log-logistic distributions for many tasks  Modeling: Almond-DG distribution for 2-d real datasets  Practical Use: spot outliers; what-if scenarios 4 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(tweets)

CMU AUEB Let’s peek... … at our contributions  Patterns: power laws between competing tasks log-logistic distributions for many tasks  Modeling: Almond-DG distribution for 2-d real datasets  Practical Use: spot outliers; what-if scenarios 5 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Let’s peek... … at our contributions  Patterns: power laws between competing tasks log-logistic distributions for many tasks  Modeling: Almond-DG distribution for 2-d real datasets  Practical Use: spot outliers; what-if scenarios 6 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Roadmap Data Observed Patterns Related Work Proposed Distribution Goodness of Fit Conclusions 7 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Data 1: Tencent Weibo micro-blogging website in China 2.2 million users Tasks extracted  Tweets  Retweets  Comments  Mentions  Followees 8 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Data 2: Phonecall Dataset phone-call records 3.1 million users Tasks extracted:  Calls  Messages  Voice friends  SMS friends  Total minutes of phonecalls 9 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Roadmap Data Observed Patterns Related Work Proposed Distribution Goodness of Fit Conclusions 10 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Pattern 1 - SuRF: Super Linear Relative Frequency (1) 11 © Danai Koutra (CMU) - PAKDD'13 ln(tweets) ln(retweets) ‘Smith’ (1100 retweets, 7 tweets) Logarithmic Binning Fit [Akoglu’10] 15 log buckets E[Y|X=x] per bucket linear regression on conditional means 0.23

CMU AUEB Pattern 1 – SuRF (2) 12 © Danai Koutra (CMU) - PAKDD'13 ln(tweets) ln(comments) Corr coeff: ++ Intuition: 2x tweets, 4x comments 0.304

CMU AUEB Pattern 1 – SuRF (3) 13 © Danai Koutra (CMU) - PAKDD'13 ln(tweets) ln(mentions) Corr coeff: ++ Intuition: 0.33

CMU AUEB Pattern 1 – SuRF (4) 14 © Danai Koutra (CMU) - PAKDD'13 ln(followees) ln(retweets) Corr coeff: ++ Intuition: 0.25

CMU AUEB Pattern 1 – SuRF (5) 15 © Danai Koutra (CMU) - PAKDD'13 Super-linear relationship: more calls, even more minutes ln(calls_no) ln(total_mins) 1.18 Corr coeff: ++ Intuition:

CMU AUEB Pattern 1 – SuRF (6a) 16 © Danai Koutra (CMU) - PAKDD'13 ln(calls_no) ln(voice_friends) 2x friends, 3x phonecalls

CMU AUEB Pattern 1 – SuRF (6b) 17 © Danai Koutra (CMU) - PAKDD'13 Telemarketers? ln(calls_no) ln(voice_friends)

CMU AUEB Pattern 1 – SuRF (7) 18 © Danai Koutra (CMU) - PAKDD'13 ln(sms_friends) ln(sms_no) 2x friends, 5x sms

CMU AUEB Contributions revisited (1)  Patterns: power laws between competing tasks log-logistic distributions for many tasks  Modeling: Almond-DG distribution for 2-d real datasets  Practical Use: spot outliers; what-if scenarios. 19 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(tweets)

CMU AUEB Pattern 2: log-logistic marginals (1) 20 © Danai Koutra (CMU) - PAKDD'13 NOT power law ln(retweets) ln(frequency) Marginal PDF

CMU AUEB 21 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(frequency) Marginal PDF NOT power law Pattern 2: log-logistic marginals (2)

CMU AUEB 22 © Danai Koutra (CMU) - PAKDD'13 power law ln(mentions) ln(frequency) Marginal PDF Pattern 2: log-logistic marginals (3)

CMU AUEB Contributions revisited (2)  Patterns: We observe power law relationships between competing tasks log-logistic distributions for many tasks  Modeling: We propose the Almond-DG distribution for fitting 2-d real world datasets  Practical Use: spot outliers; what-if scenarios. 23 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Roadmap Data Observed Patterns Proposed Distribution Problem Definition Almond-DG Background: copulas Goodness of Fit Conclusions 24 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 25 © Danai Koutra (CMU) - PAKDD'13 # of

CMU AUEB Solutions in the Literature? Multivariate Logistic [Malik & Abraham, 1973] Multivariate Pareto Distribution [Mardia, 1962] Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks 26 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Solutions in the Literature? Multivariate Logistic [Malik & Abraham, 1973] Multivariate Pareto Distribution [Mardia, 1962] Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks 27 © Danai Koutra (CMU) - PAKDD'13 BUT none of them captures the 2-d marginals AND dependency / correlation!!!

CMU AUEB Roadmap Related Work Data Observed Patterns Proposed Distribution Problem Definition Almond-DG Background: copulas Goodness of Fit Conclusions 28 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 29 © Danai Koutra (CMU) - PAKDD'13 # of

CMU AUEB STEP 1: How to model the marginal distributions? A: Log-logistic! Q: Why? A: Because it mimics Pareto captures the top concavity matches reality 30 © Danai Koutra (CMU) - PAKDD'13 ln(retweets) ln(frequency) Marginal PDF

CMU AUEB Reminder: Log-logistic (1) The longer you survive the disease, the even longer you survive Not memoryless 2 parameters: scale ( α ) and shape ( β ) BACKGROUND 31 © Danai Koutra (CMU) - PAKDD'13 a=1β=β=

CMU AUEB Reminder: Log-logistic (2a) In log-log scales, looks like hyperbola BACKGROUND 32 © Danai Koutra (CMU) - PAKDD'13 a=1β=β=

CMU AUEB Reminder: Log-logistic (2b) In log-log scales, looks like hyperbola BACKGROUND 33 © Danai Koutra (CMU) - PAKDD'13 a=1β=β= Blank out the top concavity -  power law

CMU AUEB Fact: Log-logistic (3) linear log-odd plots BACKGROUND 34 © Danai Koutra (CMU) - PAKDD'13 Prob(X x) ln(mentions) ln(odds) α = 2.07 β = 1.27

CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 35 © Danai Koutra (CMU) - PAKDD'13 # of ✔ ✔

CMU AUEB STEP 2: How to model the dependency? A: we borrow an idea from survival models, financial risk management, decision analysis COPULA! 36 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Modeling dependence between r.v.’s (e.g., X = # of, Y = # of ) BACKGROUND Copulas in a nutshell 37 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Model dependence between r.v.’s (e.g., X = # of, Y = # of ) Create multivariate distribution s.t.:  the marginals are preserved  the correlation (+, -, none) is captured BACKGROUND Copulas in a nutshell 38 © Danai Koutra (CMU) - PAKDD'13 # of

CMU AUEB STEP 2: Which copula? A: among the many copulas Blah Gumbel’s copula 39 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Applications of Gumbel’s copula Modeling of: the dependence between loss and lawyer’s fees in order to calculate reinsurance premiums the rainfall frequency as a joint distribution of volume, peak, duration etc. … BACKGROUND 40 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Gumbel’s copula: Example 1 BACKGROUND 41 © Danai Koutra (CMU) - PAKDD'13 Uniform marginals No dependence # of

CMU AUEB Gumbel’s copula: Example 2 BACKGROUND 42 © Danai Koutra (CMU) - PAKDD'13 Skewed marginals No correlation # of

CMU AUEB Gumbel’s copula: Example 3 BACKGROUND 43 © Danai Koutra (CMU) - PAKDD'13 Skewed marginals ρ = 0.7 # of

CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 44 © Danai Koutra (CMU) - PAKDD'13 # of ✔ ✔

CMU AUEB where θ = ( 1 – ρ ) -1 captures the dependence ρ = Spearman’s coefficient 45 © Danai Koutra (CMU) - PAKDD'13 ρ=0 ρ=0.4 ρ=0.7 ρ=0 ρ=0.2 ρ=0.7 α = ? β = ? α = ? β = ?

CMU AUEB - DG If (X,Y) ~ A LMOND then (floor(X), floor(Y)) ~ A LMOND - DG where X>=1 and Y>=1. i.e., we discretize the values of A LMOND, and reject the pairs with either X=0 or Y=0. 46 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Contributions revisited (3)  Patterns: We observe power laws between competing tasks log-logistic distributions for many tasks  Modeling: Almond-DG distribution for 2-d real datasets  Practical Use: spot outliers; what-if scenarios. 47 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Roadmap Related Work Data Observed Patterns Proposed Distribution Goodness of Fit Conclusions 48 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Synthetic Data Generation 49 © Danai Koutra (CMU) - PAKDD'13 Parameter Estimation Traditionally: MLE, MOM log-logistic Proposed: log-odd plot 2 parameters intercept + slope of the line Copula-based generation 1 parameter dependence θ Evaluation is hard even for 1-d skewed distributions!!! [Chakrabarti, 2006] ln(mentions) ln(odds)

CMU AUEB Goodness of Fit (1a) 50 © Danai Koutra (CMU) - PAKDD'13 ln(frequency) ln(comments) Marginal PDF ln(mentions) ln(frequency) Real data - Synthetic data 1 ✔

CMU AUEB Goodness of Fit (1b) 51 © Danai Koutra (CMU) - PAKDD'13 Contour plots Conditional Means (SuRF) Synthetic data Real data ln(mentions) ln(comments) 2 ✔ 3 ✔

CMU AUEB Goodness of Fit (2a) 52 © Danai Koutra (CMU) - PAKDD'13 Real data - Synthetic data ln(frequency) ln(retweets) Marginal PDF ln(tweets) ln(frequency) 1 ✔

CMU AUEB Goodness of Fit (2b) 53 © Danai Koutra (CMU) - PAKDD'13 Real data Synthetic data Contour plots Conditional Means (SuRF) ln(retweets) ln(tweets) ln(retweets) ln(tweets) 3 2 ✔ ✔

CMU AUEB Roadmap Related Work Data Observed Patterns Proposed Distribution Goodness of Fit Conclusions 54 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Conclusions  Patterns: Discovery of power law between competing tasks log-logistic distributions for many tasks  Modeling: Almond-DG, that explains (i) super-linearity, (ii) marginals and (iii) conditionals in real 2-d data  Practical Use: anomaly detection; what-if scenarios. 55 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(tweets)

CMU AUEB Thank you! - DG 56 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Backup slides Likely question areas: ideas glossed over, shortcomings of methods or results, and future work Why Gumbel? It fits, it has been used in the past + parsimonious (theta, alpha, beta) 57 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Why are we interested in these questions? We can: answer what-if scenarios & spot anomalies. 58 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB … but Power Laws: although prevalent, it’s not always the case in real data Logistic & Log-Logistic Distributions no earlier work provides a 2-d distribution that explains the patterns found in the real datasets, i.e., super-linearity + log-logistic marginals 59 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Goodness of Fit: mentions vs. comments Evaluation is hard even for univariate skewed distributions 60 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB STEP 2: How to model the dependency? with Copulas … and specifically, Gumbel’s copula! 61 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Copulas formally BACKGROUND 62 © Danai Koutra (CMU) - PAKDD'13 Add simple cases for Gumbel’s copula + properties of it…

Download ppt "School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r."

Similar presentations