Download presentation

Presentation is loading. Please wait.

Published byJacoby Fourman Modified over 3 years ago

1
School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r – L i n e a r i t i e s, & t h e A l m o n d-D G m o d e l Danai Koutra B.Aditya Prakash Vasileios Koutras Christos Faloutsos PAKDD, 15-17 April 2013, Gold Coast, Australia

2
CMU AUEB Questions we answer (1) Patterns: If Bob executes task x for n x times, how many times does he execute task y? Modeling: Which 2-d distribution fits 2-d clouds of points? 2 © Danai Koutra (CMU) - PAKDD'13 # of ‘Smith’ (100 calls, 700 sms)

3
CMU AUEB Questions we answer (2) Patterns: If Bob executes task x for n x times, how many times does he execute task y? Modeling: Which 2-d distribution fits 2-d clouds of points? 3 © Danai Koutra (CMU) - PAKDD'13 # of

4
CMU AUEB Let’s peek... … at our contributions Patterns: power laws between competing tasks log-logistic distributions for many tasks Modeling: Almond-DG distribution for 2-d real datasets Practical Use: spot outliers; what-if scenarios 4 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(tweets)

5
CMU AUEB Let’s peek... … at our contributions Patterns: power laws between competing tasks log-logistic distributions for many tasks Modeling: Almond-DG distribution for 2-d real datasets Practical Use: spot outliers; what-if scenarios 5 © Danai Koutra (CMU) - PAKDD'13

6
CMU AUEB Let’s peek... … at our contributions Patterns: power laws between competing tasks log-logistic distributions for many tasks Modeling: Almond-DG distribution for 2-d real datasets Practical Use: spot outliers; what-if scenarios 6 © Danai Koutra (CMU) - PAKDD'13

7
CMU AUEB Roadmap Data Observed Patterns Related Work Proposed Distribution Goodness of Fit Conclusions 7 © Danai Koutra (CMU) - PAKDD'13

8
CMU AUEB Data 1: Tencent Weibo micro-blogging website in China 2.2 million users Tasks extracted Tweets Retweets Comments Mentions Followees 8 © Danai Koutra (CMU) - PAKDD'13

9
CMU AUEB Data 2: Phonecall Dataset phone-call records 3.1 million users Tasks extracted: Calls Messages Voice friends SMS friends Total minutes of phonecalls 9 © Danai Koutra (CMU) - PAKDD'13

10
CMU AUEB Roadmap Data Observed Patterns Related Work Proposed Distribution Goodness of Fit Conclusions 10 © Danai Koutra (CMU) - PAKDD'13

11
CMU AUEB Pattern 1 - SuRF: Super Linear Relative Frequency (1) 11 © Danai Koutra (CMU) - PAKDD'13 ln(tweets) ln(retweets) ‘Smith’ (1100 retweets, 7 tweets) Logarithmic Binning Fit [Akoglu’10] 15 log buckets E[Y|X=x] per bucket linear regression on conditional means 0.23

12
CMU AUEB Pattern 1 – SuRF (2) 12 © Danai Koutra (CMU) - PAKDD'13 ln(tweets) ln(comments) Corr coeff: ++ Intuition: 2x tweets, 4x comments 0.304

13
CMU AUEB Pattern 1 – SuRF (3) 13 © Danai Koutra (CMU) - PAKDD'13 ln(tweets) ln(mentions) Corr coeff: ++ Intuition: 0.33

14
CMU AUEB Pattern 1 – SuRF (4) 14 © Danai Koutra (CMU) - PAKDD'13 ln(followees) ln(retweets) Corr coeff: ++ Intuition: 0.25

15
CMU AUEB Pattern 1 – SuRF (5) 15 © Danai Koutra (CMU) - PAKDD'13 Super-linear relationship: more calls, even more minutes ln(calls_no) ln(total_mins) 1.18 Corr coeff: ++ Intuition:

16
CMU AUEB Pattern 1 – SuRF (6a) 16 © Danai Koutra (CMU) - PAKDD'13 ln(calls_no) ln(voice_friends) 2x friends, 3x phonecalls

17
CMU AUEB Pattern 1 – SuRF (6b) 17 © Danai Koutra (CMU) - PAKDD'13 Telemarketers? ln(calls_no) ln(voice_friends)

18
CMU AUEB Pattern 1 – SuRF (7) 18 © Danai Koutra (CMU) - PAKDD'13 ln(sms_friends) ln(sms_no) 2x friends, 5x sms

19
CMU AUEB Contributions revisited (1) Patterns: power laws between competing tasks log-logistic distributions for many tasks Modeling: Almond-DG distribution for 2-d real datasets Practical Use: spot outliers; what-if scenarios. 19 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(tweets)

20
CMU AUEB Pattern 2: log-logistic marginals (1) 20 © Danai Koutra (CMU) - PAKDD'13 NOT power law ln(retweets) ln(frequency) Marginal PDF

21
CMU AUEB 21 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(frequency) Marginal PDF NOT power law Pattern 2: log-logistic marginals (2)

22
CMU AUEB 22 © Danai Koutra (CMU) - PAKDD'13 power law ln(mentions) ln(frequency) Marginal PDF Pattern 2: log-logistic marginals (3)

23
CMU AUEB Contributions revisited (2) Patterns: We observe power law relationships between competing tasks log-logistic distributions for many tasks Modeling: We propose the Almond-DG distribution for fitting 2-d real world datasets Practical Use: spot outliers; what-if scenarios. 23 © Danai Koutra (CMU) - PAKDD'13

24
CMU AUEB Roadmap Data Observed Patterns Proposed Distribution Problem Definition Almond-DG Background: copulas Goodness of Fit Conclusions 24 © Danai Koutra (CMU) - PAKDD'13

25
CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 25 © Danai Koutra (CMU) - PAKDD'13 # of

26
CMU AUEB Solutions in the Literature? Multivariate Logistic [Malik & Abraham, 1973] Multivariate Pareto Distribution [Mardia, 1962] Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks 26 © Danai Koutra (CMU) - PAKDD'13

27
CMU AUEB Solutions in the Literature? Multivariate Logistic [Malik & Abraham, 1973] Multivariate Pareto Distribution [Mardia, 1962] Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks 27 © Danai Koutra (CMU) - PAKDD'13 BUT none of them captures the 2-d marginals AND dependency / correlation!!!

28
CMU AUEB Roadmap Related Work Data Observed Patterns Proposed Distribution Problem Definition Almond-DG Background: copulas Goodness of Fit Conclusions 28 © Danai Koutra (CMU) - PAKDD'13

29
CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 29 © Danai Koutra (CMU) - PAKDD'13 # of

30
CMU AUEB STEP 1: How to model the marginal distributions? A: Log-logistic! Q: Why? A: Because it mimics Pareto captures the top concavity matches reality 30 © Danai Koutra (CMU) - PAKDD'13 ln(retweets) ln(frequency) Marginal PDF

31
CMU AUEB Reminder: Log-logistic (1) The longer you survive the disease, the even longer you survive Not memoryless 2 parameters: scale ( α ) and shape ( β ) BACKGROUND 31 © Danai Koutra (CMU) - PAKDD'13 a=1β=β=

32
CMU AUEB Reminder: Log-logistic (2a) In log-log scales, looks like hyperbola BACKGROUND 32 © Danai Koutra (CMU) - PAKDD'13 a=1β=β=

33
CMU AUEB Reminder: Log-logistic (2b) In log-log scales, looks like hyperbola BACKGROUND 33 © Danai Koutra (CMU) - PAKDD'13 a=1β=β= Blank out the top concavity - power law

34
CMU AUEB Fact: Log-logistic (3) linear log-odd plots BACKGROUND 34 © Danai Koutra (CMU) - PAKDD'13 Prob(X x) ln(mentions) ln(odds) α = 2.07 β = 1.27

35
CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 35 © Danai Koutra (CMU) - PAKDD'13 # of ✔ ✔

36
CMU AUEB STEP 2: How to model the dependency? A: we borrow an idea from survival models, financial risk management, decision analysis COPULA! 36 © Danai Koutra (CMU) - PAKDD'13

37
CMU AUEB Modeling dependence between r.v.’s (e.g., X = # of, Y = # of ) BACKGROUND Copulas in a nutshell 37 © Danai Koutra (CMU) - PAKDD'13

38
CMU AUEB Model dependence between r.v.’s (e.g., X = # of, Y = # of ) Create multivariate distribution s.t.: the marginals are preserved the correlation (+, -, none) is captured BACKGROUND Copulas in a nutshell 38 © Danai Koutra (CMU) - PAKDD'13 # of

39
CMU AUEB STEP 2: Which copula? A: among the many copulas Blah Gumbel’s copula 39 © Danai Koutra (CMU) - PAKDD'13

40
CMU AUEB Applications of Gumbel’s copula Modeling of: the dependence between loss and lawyer’s fees in order to calculate reinsurance premiums the rainfall frequency as a joint distribution of volume, peak, duration etc. … BACKGROUND 40 © Danai Koutra (CMU) - PAKDD'13

41
CMU AUEB Gumbel’s copula: Example 1 BACKGROUND 41 © Danai Koutra (CMU) - PAKDD'13 Uniform marginals No dependence # of

42
CMU AUEB Gumbel’s copula: Example 2 BACKGROUND 42 © Danai Koutra (CMU) - PAKDD'13 Skewed marginals No correlation # of

43
CMU AUEB Gumbel’s copula: Example 3 BACKGROUND 43 © Danai Koutra (CMU) - PAKDD'13 Skewed marginals ρ = 0.7 # of

44
CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 44 © Danai Koutra (CMU) - PAKDD'13 # of ✔ ✔

45
CMU AUEB where θ = ( 1 – ρ ) -1 captures the dependence ρ = Spearman’s coefficient 45 © Danai Koutra (CMU) - PAKDD'13 ρ=0 ρ=0.4 ρ=0.7 ρ=0 ρ=0.2 ρ=0.7 α = ? β = ? α = ? β = ?

46
CMU AUEB - DG If (X,Y) ~ A LMOND then (floor(X), floor(Y)) ~ A LMOND - DG where X>=1 and Y>=1. i.e., we discretize the values of A LMOND, and reject the pairs with either X=0 or Y=0. 46 © Danai Koutra (CMU) - PAKDD'13

47
CMU AUEB Contributions revisited (3) Patterns: We observe power laws between competing tasks log-logistic distributions for many tasks Modeling: Almond-DG distribution for 2-d real datasets Practical Use: spot outliers; what-if scenarios. 47 © Danai Koutra (CMU) - PAKDD'13

48
CMU AUEB Roadmap Related Work Data Observed Patterns Proposed Distribution Goodness of Fit Conclusions 48 © Danai Koutra (CMU) - PAKDD'13

49
CMU AUEB Synthetic Data Generation 49 © Danai Koutra (CMU) - PAKDD'13 Parameter Estimation Traditionally: MLE, MOM log-logistic Proposed: log-odd plot 2 parameters intercept + slope of the line Copula-based generation 1 parameter dependence θ Evaluation is hard even for 1-d skewed distributions!!! [Chakrabarti, 2006] ln(mentions) ln(odds)

50
CMU AUEB Goodness of Fit (1a) 50 © Danai Koutra (CMU) - PAKDD'13 ln(frequency) ln(comments) Marginal PDF ln(mentions) ln(frequency) Real data - Synthetic data 1 ✔

51
CMU AUEB Goodness of Fit (1b) 51 © Danai Koutra (CMU) - PAKDD'13 Contour plots Conditional Means (SuRF) Synthetic data Real data ln(mentions) ln(comments) 2 ✔ 3 ✔

52
CMU AUEB Goodness of Fit (2a) 52 © Danai Koutra (CMU) - PAKDD'13 Real data - Synthetic data ln(frequency) ln(retweets) Marginal PDF ln(tweets) ln(frequency) 1 ✔

53
CMU AUEB Goodness of Fit (2b) 53 © Danai Koutra (CMU) - PAKDD'13 Real data Synthetic data Contour plots Conditional Means (SuRF) ln(retweets) ln(tweets) ln(retweets) ln(tweets) 3 2 ✔ ✔

54
CMU AUEB Roadmap Related Work Data Observed Patterns Proposed Distribution Goodness of Fit Conclusions 54 © Danai Koutra (CMU) - PAKDD'13

55
CMU AUEB Conclusions Patterns: Discovery of power law between competing tasks log-logistic distributions for many tasks Modeling: Almond-DG, that explains (i) super-linearity, (ii) marginals and (iii) conditionals in real 2-d data Practical Use: anomaly detection; what-if scenarios. 55 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(tweets)

56
CMU AUEB Thank you! - DG 56 © Danai Koutra (CMU) - PAKDD'13

57
CMU AUEB Backup slides Likely question areas: ideas glossed over, shortcomings of methods or results, and future work Why Gumbel? It fits, it has been used in the past + parsimonious (theta, alpha, beta) 57 © Danai Koutra (CMU) - PAKDD'13

58
CMU AUEB Why are we interested in these questions? We can: answer what-if scenarios & spot anomalies. 58 © Danai Koutra (CMU) - PAKDD'13

59
CMU AUEB … but Power Laws: although prevalent, it’s not always the case in real data Logistic & Log-Logistic Distributions no earlier work provides a 2-d distribution that explains the patterns found in the real datasets, i.e., super-linearity + log-logistic marginals 59 © Danai Koutra (CMU) - PAKDD'13

60
CMU AUEB Goodness of Fit: mentions vs. comments Evaluation is hard even for univariate skewed distributions 60 © Danai Koutra (CMU) - PAKDD'13

61
CMU AUEB STEP 2: How to model the dependency? with Copulas … and specifically, Gumbel’s copula! 61 © Danai Koutra (CMU) - PAKDD'13

62
CMU AUEB Copulas formally BACKGROUND 62 © Danai Koutra (CMU) - PAKDD'13 Add simple cases for Gumbel’s copula + properties of it…

Similar presentations

OK

BCOR 1020 Business Statistics Lecture 24 – April 17, 2008.

BCOR 1020 Business Statistics Lecture 24 – April 17, 2008.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on high voltage engineering fundamentals Presentations ppt online training Ppt on unity in diversity dance Ppt on sources of energy for class 8th result Ppt on pricing policy in marketing Ppt on javascript events click Ppt on varactor diode function Ppt on summary writing ppt Ppt on different types of dance forms of indian By appt only movie