Presentation on theme: "Toyota InfoTechnology Center U.S.A, Inc. 1 Mixture Models of End-host Network Traffic John Mark Agosta, Jaideep Chandrashekar, Mark Crovella, Nina Taft."— Presentation transcript:
Toyota InfoTechnology Center U.S.A, Inc. 1 Mixture Models of End-host Network Traffic John Mark Agosta, Jaideep Chandrashekar, Mark Crovella, Nina Taft and Daniel Ting Toyota-ITC, Technicolor, Boston U., Technicolor, Facebook
Toyota InfoTechnology Center U.S.A, Inc. Outline We collected traffic at the end-host; something rarely monitored. Conventional distributions don’t fit heavy tailed data The dense part of the distribution doesn’t look Pareto, & just fitting the Pareto tail doesn’t describe the data. Fit by mixture models – but not the typical Gaussian mixtures – of a Pareto tail with exponentials as a proxy for the dense part. Model Selection – best number of components constrained by complexity penalty & returns a model of the entire distribution. Uses: Better tail parameter estimates than conventional measures. Soft clustering – assign traffic to exponential v/s Pareto components, by protocol More stable threshold setting 2
Toyota InfoTechnology Center U.S.A, Inc. Data collection effort End-host flows: Collected at Laptop network port Collection moved around with device Assembled from packet trace headers On enterprise XP build Periodic server uploads Logged with user & CPU activity, to eliminate off periods. Data Sets: 270 personal machine data sets 90% laptops 5 week duration 400G raw data, total. Flow initiation counts are binned in intervals from 4 to 512 seconds Removed zero-count intervals Median sample 9800 points Max sample size 264k 3
Toyota InfoTechnology Center U.S.A, Inc. Heavy tailed data is extremely wide compared to conventional distributions. Fitting any exponential family distribution (e.g. Gaussian, Poisson…) fails. Any exponential tail is too steep. Fitting a mixture of exponential families requires an impractical number of components. But just fitting the power law tail ignores most of the probability mass 4 Best fit normal
Toyota InfoTechnology Center U.S.A, Inc. Heavy tailed data is extremely wide compared to conventional distributions. Fitting any exponential family distribution (e.g. Gaussian, Poisson…) fails. Any exponential tail is too steep. Fitting a mixture of exponential families requires an impractical number of components. But just fitting the power law tail ignores most of the probability mass 5 Best fit normal
Toyota InfoTechnology Center U.S.A, Inc. The distribution looks like an exponential above and a power law below 6 good fit bad fit good fit Power law fit Exponential fit
Toyota InfoTechnology Center U.S.A, Inc. Exponential – Pareto mixture models. A mixture model is a hierarchical model where the mixing weights determine the probability of each of the component models, which in turn generate the sample points. Since all components share the same support, any sample point could in principle have been generated by any component, by its mixing probability. We consider three models: Pareto: One power-law component Exponential – Pareto: One of each 2 Exponentials, one Pareto: Any more exponential components cannot be resolved. 7
Toyota InfoTechnology Center U.S.A, Inc. By modeling the entire data set, mixture models give more accurate tail α-parameter estimates than methods that consider only the tail data. 8 When tested on synthetic Pareto-tailed data, EP mixture model estimator performs significantly better than the well-known AEST method. (AEST estimates are shown on the left, and EP-based estimates on the right in each pane.)
Toyota InfoTechnology Center U.S.A, Inc. Model Selection versus Goodness-of-Fit Goodness-of-fit tests, while useful for initial characterization, don’t have an explicit acceptance criterion, and, as data set size increases, will eventually reject all models. A Model selection is a relative, pairwise criterion that derives from comparison of likelihoods. We use the Bayes Information Criterion to approximate the Bayes Factor terms. It penalizes the maximum likelihood by the model degrees of freedom, d, so that models of different number of parameters can be compared. 9 The Bayes Factor is the ratio of the marginal likelihood of one model (EP) to another (P). For instance a log Bayes Factor of 5 indicates the probability of the data given one model versus the other is over a 100:1. With the BIC approximation, the log Bayes Factor becomes
Toyota InfoTechnology Center U.S.A, Inc. Pairwise BIC comparisons of the reveal large log BF values for EP vs P and smaller values for EEP vs EP 10 Boxplot of BIC comparison for Pareto vs. EP Mixture Model. Boxplot of BIC comparison for EP vs. EEP Mixture Model. EP P EEP EP
Toyota InfoTechnology Center U.S.A, Inc. Model Selection Results Model selection results based on Bayes Factors, over all users. Each bar represents the same user set with a different binning time window. For the P, EP, and EEP models -- P: Only a handful of users are given the Pareto-only model, EP: Overall, the EP model is selected for 50-85% of the users, depending upon the bin size, and EEP: Between 15%-40% of user machines are best modeled by EEP, again depending upon the bin size. 11 P EP EEP
Toyota InfoTechnology Center U.S.A, Inc. Histograms of Heavy-Tail Parameters’ Variation, EP Model. 12 The difference across users is significant.
Toyota InfoTechnology Center U.S.A, Inc. Partitioning traffic into Exponential and Pareto ranges Mixture fractions as a function of connections indicate (soft) membership of the data into a component. In this example, bins with less than 82 counts are almost entirely exponential, and those with greater than 82, almost entirely Pareto. This way different sources of the traffic can be characterized as heavy-tailed or not. 13 Mixture Fractions, User 256 m Pareto m exp P(traffic)
Toyota InfoTechnology Center U.S.A, Inc. Traffic Fractions, in Exponential and Pareto Components, by Protocol 14 Although Exponential traffic dominates in all cases, the long tail (i.e. Pareto) traffic appears largely from bursts of ICMP, DNS and web traffic flows.
Toyota InfoTechnology Center U.S.A, Inc. In summary 1. We have modeled traffic as flow initiations from end hosts in an enterprise, using mixture models, employing model selection. 2. We have discovered Strong evidence that the traffic, is almost always heavy-tailed, with the Pareto component contributing about 1/4 of the probability mass. and with power law scaling parameter with mean = 1.6 that varies widely, between 1.0 and 2.0. 3. Apparently DNS, ICMP and some web traffic make up the tail component. 15 http://arxiv.org/abs/1212.2744 See the full paper at