School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r.

Slides:



Advertisements
Similar presentations
BiG-Align: Fast Bipartite Graph Alignment
Advertisements

FUNNEL: Automatic Mining of Spatially Coevolving Epidemics Yasuko Matsubara, Yasushi Sakurai (Kumamoto University) Willem G. van Panhuis (University of.
School of Computer Science Carnegie Mellon University Duke University DeltaCon: A Principled Massive- Graph Similarity Function Danai Koutra Joshua T.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Fast Algorithms For Hierarchical Range Histogram Constructions
Modeling Blog Dynamics Speaker: Michaela Götz Joint work with: Jure Leskovec, Mary McGlohon, Christos Faloutsos Cornell University Carnegie Mellon University.
Modelling with parameter- mixture copulas October 2006 Xiangyuan Tommy Chen Econometrics & Business Statistics The University of Sydney
XIV International Conference on Economic and Social Development, 2-5 April 2013, Moscow A new copula approach for high-dimensional real world portfolios.
Copula Functions and Bivariate Distributions: Applications to Political Interdependence Alejandro Quiroz Flores, Wilf Department of Politics, NYU Motivation.
Pair-copula constructions of multiple dependence Workshop on ''Copulae: Theory and Practice'' Weierstrass Institute for Applied Analysis and.
HSRP 734: Advanced Statistical Methods July 24, 2008.
Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.
STAT 497 APPLIED TIME SERIES ANALYSIS
this photo View full size photo Visit the album this photo belongs to Check out the slide show Download photo Bookmark photo Publish photo Comment.
Simulating Exchangeable Multivariate Archimedean Copulas and its Applications Authors: Florence Wu Emiliano A. Valdez Michael Sherris.
CMU SCS Mining Billion-node Graphs Christos Faloutsos CMU.
WindMine: Fast and Effective Mining of Web-click Sequences SDM 2011Y. Sakurai et al.1 Yasushi Sakurai (NTT) Lei Li (Carnegie Mellon Univ.) Yasuko Matsubara.
Statistics Psych 231: Research Methods in Psychology.
Analysis of the Internet Topology Michalis Faloutsos, U.C. Riverside (PI) Christos Faloutsos, CMU (sub- contract, co-PI) DARPA NMS, no
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
Archimedean Copulas Theodore Charitos MSc. Student CROSS.
Correlational Designs
Structure, Tie Persistence and Event Detection in Large Phone and SMS Networks Leman Akoglu and Bhavana Dalvi {lakoglu, Carnegie Mellon.
Lecture II-2: Probability Review
1 Multivariate Normal Distribution Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Computer vision: models, learning and inference Chapter 6 Learning and Inference in Vision.
THE SCIENCE OF RISK SM 1 Interaction Detection in GLM – a Case Study Chun Li, PhD ISO Innovative Analytics March 2012.
Risk Modeling of Multi-year, Multi-line Reinsurance Using Copulas
Correlation.
Bivariate Data When two variables are measured on a single experimental unit, the resulting data are called bivariate data. You can describe each variable.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Part 4 Curve Fitting.
Behavioral Entropy of a Cellular Phone User Santi Phithakkitnukoon Husain Husna Ram Dantu (Presenter) Computer Science & Engineering University of North.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Lévy copulas: Basic ideas and a new estimation method J L van Velsen, EC Modelling, ABN Amro TopQuants, November 2013.
Expectation for multivariate distributions. Definition Let X 1, X 2, …, X n denote n jointly distributed random variable with joint density function f(x.
Examining Bivariate Data Unit 3 – Statistics. Some Vocabulary Response aka Dependent Variable –Measures an outcome of a study Explanatory aka Independent.
Using Copulas.
Xintao Wu Jan 18, 2013 Retweeting Behavior and Spectral Graph Analysis in Social Media.
Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)
Correlation/Regression - part 2 Consider Example 2.12 in section 2.3. Look at the scatterplot… Example 2.13 shows that the prediction line is given by.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Du, Faloutsos, Wang, Akoglu Large Human Communication Networks Patterns and a Utility-Driven Generator Nan Du 1,2, Christos Faloutsos 2, Bai Wang 1, Leman.
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
STOCHASTIC HYDROLOGY Stochastic Simulation of Bivariate Distributions Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.
Linear Correlation (12.5) In the regression analysis that we have considered so far, we assume that x is a controlled independent variable and Y is an.
Regression Analysis1. 2 INTRODUCTION TO EMPIRICAL MODELS LEAST SQUARES ESTIMATION OF THE PARAMETERS PROPERTIES OF THE LEAST SQUARES ESTIMATORS AND ESTIMATION.
REGRESSION MODEL FITTING & IDENTIFICATION OF PROGNOSTIC FACTORS BISMA FAROOQI.
1 Lecture Plan Modelling Profit Distribution from Wind Production (Excel Case: Danish Wind Production and Spot Prices) Reasons for copula.
 DM-Group Meeting Liangzhe Chen, Oct Papers to be present  RSC: Mining and Modeling Temporal Activity in Social Media  KDD’15  A. F. Costa,
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Inference about the slope parameter and correlation
Chapter 4: Basic Estimation Techniques
Chapter 4 Basic Estimation Techniques
Computer vision: models, learning and inference
Oliver Schulte Machine Learning 726
BGP-lens: Patterns and Anomalies in Internet Routing Updates
Non-linear Mining of Competing Local Activities
Correlation – Regression
SIMPLE LINEAR REGRESSION MODEL
NetMine: Mining Tools for Large Graphs
…Don’t be afraid of others, because they are bigger than you
Basic Estimation Techniques
Statistical Inference about Regression
Graph and Tensor Mining for fun and profit
Graph and Tensor Mining for fun and profit
GANG: Detecting Fraudulent Users in OSNs
Power and Sample Size I HAVE THE POWER!!! Boulder 2006 Benjamin Neale.
Professor Ke-sheng Cheng
Monday, March 10th Warm Up To find the residual you take the ACTUAL data and ______________ the PREDICTED data. If the residual plot creates a pattern.
Presentation transcript:

School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r – L i n e a r i t i e s, & t h e A l m o n d-D G m o d e l Danai Koutra B.Aditya Prakash Vasileios Koutras Christos Faloutsos PAKDD, April 2013, Gold Coast, Australia

CMU AUEB Questions we answer (1)  Patterns: If Bob executes task x for n x times, how many times does he execute task y?  Modeling: Which 2-d distribution fits 2-d clouds of points? 2 © Danai Koutra (CMU) - PAKDD'13 # of ‘Smith’ (100 calls, 700 sms)

CMU AUEB Questions we answer (2)  Patterns: If Bob executes task x for n x times, how many times does he execute task y?  Modeling: Which 2-d distribution fits 2-d clouds of points? 3 © Danai Koutra (CMU) - PAKDD'13 # of

CMU AUEB Let’s peek... … at our contributions  Patterns: power laws between competing tasks log-logistic distributions for many tasks  Modeling: Almond-DG distribution for 2-d real datasets  Practical Use: spot outliers; what-if scenarios 4 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(tweets)

CMU AUEB Let’s peek... … at our contributions  Patterns: power laws between competing tasks log-logistic distributions for many tasks  Modeling: Almond-DG distribution for 2-d real datasets  Practical Use: spot outliers; what-if scenarios 5 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Let’s peek... … at our contributions  Patterns: power laws between competing tasks log-logistic distributions for many tasks  Modeling: Almond-DG distribution for 2-d real datasets  Practical Use: spot outliers; what-if scenarios 6 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Roadmap Data Observed Patterns Related Work Proposed Distribution Goodness of Fit Conclusions 7 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Data 1: Tencent Weibo micro-blogging website in China 2.2 million users Tasks extracted  Tweets  Retweets  Comments  Mentions  Followees 8 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Data 2: Phonecall Dataset phone-call records 3.1 million users Tasks extracted:  Calls  Messages  Voice friends  SMS friends  Total minutes of phonecalls 9 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Roadmap Data Observed Patterns Related Work Proposed Distribution Goodness of Fit Conclusions 10 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Pattern 1 - SuRF: Super Linear Relative Frequency (1) 11 © Danai Koutra (CMU) - PAKDD'13 ln(tweets) ln(retweets) ‘Smith’ (1100 retweets, 7 tweets) Logarithmic Binning Fit [Akoglu’10] 15 log buckets E[Y|X=x] per bucket linear regression on conditional means 0.23

CMU AUEB Pattern 1 – SuRF (2) 12 © Danai Koutra (CMU) - PAKDD'13 ln(tweets) ln(comments) Corr coeff: ++ Intuition: 2x tweets, 4x comments 0.304

CMU AUEB Pattern 1 – SuRF (3) 13 © Danai Koutra (CMU) - PAKDD'13 ln(tweets) ln(mentions) Corr coeff: ++ Intuition: 0.33

CMU AUEB Pattern 1 – SuRF (4) 14 © Danai Koutra (CMU) - PAKDD'13 ln(followees) ln(retweets) Corr coeff: ++ Intuition: 0.25

CMU AUEB Pattern 1 – SuRF (5) 15 © Danai Koutra (CMU) - PAKDD'13 Super-linear relationship: more calls, even more minutes ln(calls_no) ln(total_mins) 1.18 Corr coeff: ++ Intuition:

CMU AUEB Pattern 1 – SuRF (6a) 16 © Danai Koutra (CMU) - PAKDD'13 ln(calls_no) ln(voice_friends) 2x friends, 3x phonecalls

CMU AUEB Pattern 1 – SuRF (6b) 17 © Danai Koutra (CMU) - PAKDD'13 Telemarketers? ln(calls_no) ln(voice_friends)

CMU AUEB Pattern 1 – SuRF (7) 18 © Danai Koutra (CMU) - PAKDD'13 ln(sms_friends) ln(sms_no) 2x friends, 5x sms

CMU AUEB Contributions revisited (1)  Patterns: power laws between competing tasks log-logistic distributions for many tasks  Modeling: Almond-DG distribution for 2-d real datasets  Practical Use: spot outliers; what-if scenarios. 19 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(tweets)

CMU AUEB Pattern 2: log-logistic marginals (1) 20 © Danai Koutra (CMU) - PAKDD'13 NOT power law ln(retweets) ln(frequency) Marginal PDF

CMU AUEB 21 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(frequency) Marginal PDF NOT power law Pattern 2: log-logistic marginals (2)

CMU AUEB 22 © Danai Koutra (CMU) - PAKDD'13 power law ln(mentions) ln(frequency) Marginal PDF Pattern 2: log-logistic marginals (3)

CMU AUEB Contributions revisited (2)  Patterns: We observe power law relationships between competing tasks log-logistic distributions for many tasks  Modeling: We propose the Almond-DG distribution for fitting 2-d real world datasets  Practical Use: spot outliers; what-if scenarios. 23 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Roadmap Data Observed Patterns Proposed Distribution Problem Definition Almond-DG Background: copulas Goodness of Fit Conclusions 24 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 25 © Danai Koutra (CMU) - PAKDD'13 # of

CMU AUEB Solutions in the Literature? Multivariate Logistic [Malik & Abraham, 1973] Multivariate Pareto Distribution [Mardia, 1962] Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks 26 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Solutions in the Literature? Multivariate Logistic [Malik & Abraham, 1973] Multivariate Pareto Distribution [Mardia, 1962] Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks 27 © Danai Koutra (CMU) - PAKDD'13 BUT none of them captures the 2-d marginals AND dependency / correlation!!!

CMU AUEB Roadmap Related Work Data Observed Patterns Proposed Distribution Problem Definition Almond-DG Background: copulas Goodness of Fit Conclusions 28 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 29 © Danai Koutra (CMU) - PAKDD'13 # of

CMU AUEB STEP 1: How to model the marginal distributions? A: Log-logistic! Q: Why? A: Because it mimics Pareto captures the top concavity matches reality 30 © Danai Koutra (CMU) - PAKDD'13 ln(retweets) ln(frequency) Marginal PDF

CMU AUEB Reminder: Log-logistic (1) The longer you survive the disease, the even longer you survive Not memoryless 2 parameters: scale ( α ) and shape ( β ) BACKGROUND 31 © Danai Koutra (CMU) - PAKDD'13 a=1β=β=

CMU AUEB Reminder: Log-logistic (2a) In log-log scales, looks like hyperbola BACKGROUND 32 © Danai Koutra (CMU) - PAKDD'13 a=1β=β=

CMU AUEB Reminder: Log-logistic (2b) In log-log scales, looks like hyperbola BACKGROUND 33 © Danai Koutra (CMU) - PAKDD'13 a=1β=β= Blank out the top concavity -  power law

CMU AUEB Fact: Log-logistic (3) linear log-odd plots BACKGROUND 34 © Danai Koutra (CMU) - PAKDD'13 Prob(X x) ln(mentions) ln(odds) α = 2.07 β = 1.27

CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 35 © Danai Koutra (CMU) - PAKDD'13 # of ✔ ✔

CMU AUEB STEP 2: How to model the dependency? A: we borrow an idea from survival models, financial risk management, decision analysis COPULA! 36 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Modeling dependence between r.v.’s (e.g., X = # of, Y = # of ) BACKGROUND Copulas in a nutshell 37 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Model dependence between r.v.’s (e.g., X = # of, Y = # of ) Create multivariate distribution s.t.:  the marginals are preserved  the correlation (+, -, none) is captured BACKGROUND Copulas in a nutshell 38 © Danai Koutra (CMU) - PAKDD'13 # of

CMU AUEB STEP 2: Which copula? A: among the many copulas Blah Gumbel’s copula 39 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Applications of Gumbel’s copula Modeling of: the dependence between loss and lawyer’s fees in order to calculate reinsurance premiums the rainfall frequency as a joint distribution of volume, peak, duration etc. … BACKGROUND 40 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Gumbel’s copula: Example 1 BACKGROUND 41 © Danai Koutra (CMU) - PAKDD'13 Uniform marginals No dependence # of

CMU AUEB Gumbel’s copula: Example 2 BACKGROUND 42 © Danai Koutra (CMU) - PAKDD'13 Skewed marginals No correlation # of

CMU AUEB Gumbel’s copula: Example 3 BACKGROUND 43 © Danai Koutra (CMU) - PAKDD'13 Skewed marginals ρ = 0.7 # of

CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 44 © Danai Koutra (CMU) - PAKDD'13 # of ✔ ✔

CMU AUEB where θ = ( 1 – ρ ) -1 captures the dependence ρ = Spearman’s coefficient 45 © Danai Koutra (CMU) - PAKDD'13 ρ=0 ρ=0.4 ρ=0.7 ρ=0 ρ=0.2 ρ=0.7 α = ? β = ? α = ? β = ?

CMU AUEB - DG If (X,Y) ~ A LMOND then (floor(X), floor(Y)) ~ A LMOND - DG where X>=1 and Y>=1. i.e., we discretize the values of A LMOND, and reject the pairs with either X=0 or Y=0. 46 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Contributions revisited (3)  Patterns: We observe power laws between competing tasks log-logistic distributions for many tasks  Modeling: Almond-DG distribution for 2-d real datasets  Practical Use: spot outliers; what-if scenarios. 47 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Roadmap Related Work Data Observed Patterns Proposed Distribution Goodness of Fit Conclusions 48 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Synthetic Data Generation 49 © Danai Koutra (CMU) - PAKDD'13 Parameter Estimation Traditionally: MLE, MOM log-logistic Proposed: log-odd plot 2 parameters intercept + slope of the line Copula-based generation 1 parameter dependence θ Evaluation is hard even for 1-d skewed distributions!!! [Chakrabarti, 2006] ln(mentions) ln(odds)

CMU AUEB Goodness of Fit (1a) 50 © Danai Koutra (CMU) - PAKDD'13 ln(frequency) ln(comments) Marginal PDF ln(mentions) ln(frequency) Real data - Synthetic data 1 ✔

CMU AUEB Goodness of Fit (1b) 51 © Danai Koutra (CMU) - PAKDD'13 Contour plots Conditional Means (SuRF) Synthetic data Real data ln(mentions) ln(comments) 2 ✔ 3 ✔

CMU AUEB Goodness of Fit (2a) 52 © Danai Koutra (CMU) - PAKDD'13 Real data - Synthetic data ln(frequency) ln(retweets) Marginal PDF ln(tweets) ln(frequency) 1 ✔

CMU AUEB Goodness of Fit (2b) 53 © Danai Koutra (CMU) - PAKDD'13 Real data Synthetic data Contour plots Conditional Means (SuRF) ln(retweets) ln(tweets) ln(retweets) ln(tweets) 3 2 ✔ ✔

CMU AUEB Roadmap Related Work Data Observed Patterns Proposed Distribution Goodness of Fit Conclusions 54 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Conclusions  Patterns: Discovery of power law between competing tasks log-logistic distributions for many tasks  Modeling: Almond-DG, that explains (i) super-linearity, (ii) marginals and (iii) conditionals in real 2-d data  Practical Use: anomaly detection; what-if scenarios. 55 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(tweets)

CMU AUEB Thank you! - DG 56 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Backup slides Likely question areas: ideas glossed over, shortcomings of methods or results, and future work Why Gumbel? It fits, it has been used in the past + parsimonious (theta, alpha, beta) 57 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Why are we interested in these questions? We can: answer what-if scenarios & spot anomalies. 58 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB … but Power Laws: although prevalent, it’s not always the case in real data Logistic & Log-Logistic Distributions no earlier work provides a 2-d distribution that explains the patterns found in the real datasets, i.e., super-linearity + log-logistic marginals 59 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Goodness of Fit: mentions vs. comments Evaluation is hard even for univariate skewed distributions 60 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB STEP 2: How to model the dependency? with Copulas … and specifically, Gumbel’s copula! 61 © Danai Koutra (CMU) - PAKDD'13

CMU AUEB Copulas formally BACKGROUND 62 © Danai Koutra (CMU) - PAKDD'13 Add simple cases for Gumbel’s copula + properties of it…