© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College,

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College, MA June 29th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Outline What are mixture models? –Definitions and examples How can we learn mixture models? –A brief history and illustration What are mixture models useful for? –Applications in Web and transaction data Recent research in mixtures?

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Acknowledgements Students: –Igor Cadez, Scott Gaffney, Xianping Ge, Dima Pavlov Collaborators –David Heckerman, Chris Meek, Heikki Mannila, Christine McLaren, Geoff McLachlan, David Wolpert Funding –NSF, NIH, NIST, KLA-Tencor, UCI Cancer Center, Microsoft Research, IBM Research, HNC Software.

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Example: Mixture of Gaussians Gaussian mixtures: Each mixture component is a multidimensional Gaussian with its own mean  k and covariance “shape”  k

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Example: Mixture of Gaussians Gaussian mixtures: Each mixture component is a multidimensional Gaussian with its own mean  k and covariance “shape”  k e.g., K=2, 1-dim: {} = { 1 ,  1,  2 ,  2,   }

Example: Mixture of Naïve Bayes

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Other Component Models Mixtures of Rectangles –Pelleg and Moore (ICML, 2001) Mixtures of Trees –Meila and Jordan (2000) Mixtures of Curves –Quandt and Ramsey (1978) Mixtures of Sequences –Poulsen (1990)

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Interpretation of Mixtures 1. C has a direct (physical) interpretation e.g., C  {age of fish}, C = {male, female} 2. C might have an interpretation e.g., clusters of Web surfers

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Interpretation of Mixtures 1. C has a direct (physical) interpretation e.g., C  {age of fish}, C = {male, female} 2. C might have an interpretation e.g., clusters of Web surfers 3. C is just a convenient latent variable e.g., flexible density estimation

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Sequential Mixtures C X1X1 X2X2 X3X3 X1X1 X2X2 X3X3 C X1X1 X2X2 X3X3 C Time t-1Time tTime t+1 Markov Mixtures = C has Markov dependence = Hidden Markov Model (here with naïve Bayes) C = discrete state, couples observables through time

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Dynamic Mixtures Computer Vision mixtures of Kalman filters for tracking Atmospheric Science mixtures of curves and dynamical models for cyclones Economics mixtures of switching regressions for the US economy

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Limitations of Mixtures Discrete state space –not always appropriate –e.g., in modeling dynamical systems Training –no closed form solution, can be tricky Interpretability –many different mixture solutions may explain the same data

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Learning Mixtures from Data Consider fixed K e.g., Unknown parameters  = { 1 ,  1,  2 ,  2,   } Given data D = {x 1,…….x N }, we want to find the parameters  that “best fit” the data

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Early Attempts Karl Pearson, 1894: - JRSS paper - proposed a mixture of 2 Gaussians - 5 parameters  = { 1 ,  1,  2 ,  2,   } - parameter estimation -> method of moments - involved solution of 9th order equations! ( see Chapter 10, Stigler (1986), The History of Statistics )

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk “The solution of an equation of the ninth degree, where almost all powers, to the ninth, of the unknown quantity are existing, is, however, a very laborious task. Mr. Pearson has indeed possessed the energy to perform his heroic task…. But I fear he will have few successors…..” Charlier (1906)

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Maximum Likelihood Principle Fisher, 1922 –assume a probabilistic model –likelihood = p(data | parameters, model) –find the parameters that make the data most likely

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk 1977: The EM Algorithm Dempster, Laird, and Rubin –general framework for likelihood-based parameter estimation with missing data start with initial guesses of parameters Estep: estimate memberships given params Mstep: estimate params given memberships Repeat until convergence –converges to a (local) maximum of likelihood –Estep and Mstep are often computationally simple –generalizes to maximum a posteriori (with priors)

Example of a Log-Likelihood Surface Log Scale for Sigma 2 Mean 2

Anemia Group Control Group

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Alternatives to EM Method of Moments –EM is more efficient Direct optimization –e.g., gradient descent, Newton methods –EM is simpler to implement Sampling (e.g., MCMC) Minimum distance, e.g.,

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk How many components? 2 general approaches: 1. Best density estimator e.g., what predicts best on new data 2. True number of components - cannot be answered from data alone

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Prescriptions for Model Selection Minimize distance to “truth” Maximize predictive “logp score” –gives an estimate of KL(model, truth) –pick model that predicts best on validation data Bayesian techniques –p(k|data): impossible to compute exactly –closed-form approximations: BIC, Autoclass, etc –sampling Monte Carlo techniques: quite tricky for mixtures

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk What are Mixtures used for? Modeling heterogeneity –e.g., inferring multiple species in biology Handling missing data –e.g., variables and cases missing in model-building Density estimation –e.g., as flexible priors in Bayesian statistics Clustering –components as clusters Model Averaging –combining density models

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Mixtures of non-vector data Example –N individuals, and sets of sequences for each –e.g., Web session data Clustering of the N individuals? –Vectorize data and apply vector methods? –Pairwise distances of sets of sequences? –Parameter estimate for each individual and then cluster?

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Mixtures of {Sequences, Curves, …} Generative Model - select a component c k for individual i - generate data according to p(D i | c k ) - p(D i | c k ) can be very general - e.g., sets of sequences, spatial patterns, etc [Note: given p(D i | c k ), we can define an EM algorithm]

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Application 1: Web Log Visualization (Cadez, Heckerman, Meek, Smyth, KDD 2000) MSNBC Web logs –2 million individuals per day –different session lengths per individual –difficult visualization and clustering problem WebCanvas –uses mixtures of SFSMs to cluster individuals based on their observed sequences –software tool: EM mixture modeling + visualization

Example: Mixtures of SFSMs Simple model for traversal on a Web site (equivalent to first-order Markov with end-state) Generative model for large sets of Web users - different behaviors mixture of SFSMs EM algorithm is quite simple: weighted counts

Timing Results

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Profiling for Transaction Data Profiling –given transactions for a set of individuals –infer a predictive model for future transactions of individual i –typical applications: automated recommender systems: e.g., Amazon.com personalization: e.g., wireless information services –existing techniques collaborative filtering, association rules unsuited for prediction, e.g., inclusion of seasonality.

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Retail Data Set –200,000 individuals –all market baskets over 2 years –50 departments, 50k items Problem –predict what an individual will purchase in the future want to generalize across products want to allow heterogeneity Application 2: Transaction Data (Cadez, Smyth, Mannila, KDD 2001)

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Mixture-based Profiles Predictive Profile for Individual i Basket model for Component k (“Basis function”) Probability that Individual i engages in “behavior” k, given that they enter the store

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Hierarchical Bayes Model Empirical Prior on Mixture Weights Individual 1 B1 Individual i B1B2B3 Individual N B1B2 Individuals with little data get “shrunk” to the prior Individuals with a lot of data are more data-driven

Transaction mixtures Mixture-based profiles –interpretable and flexible –more accurate than non-mixture approaches –training time = linear(number of items) Applications –early detection of high-value customers –visualization and exploration –forecasting customer behavior

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Extensions: Learning Algorithms Fast algorithms kd-trees (Moore, 1998) Caching(Bradley et al.) Random projections: DasGupta (2000) Mean-squared error criteria Scott (2000) Bayesian techniques reversible-jump MCMC (Green, et al.)

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Classic References Statistical Analysis of Finite Mixture Distributions Titterington, Smith, and Makov Wiley, 1985 Finite Mixture Models McLachlan and Peel Wiley 2000

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Conclusions Mixtures –flexible “tool” in the machine learner’s toolbox Beyond mixtures of Gaussians… –mixtures of sequences, curves, trees, etc. Applications –numerous and broad

Concavity of Likelihood (Cadez and Smyth, NIPS 2000)

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Application 3: Model Averaging Prediction of Model k Weight of Model k Bayesian model averaging for p(x): - since we don’t know which model (if any) is the true one, average out this uncertainty Prediction of x given data D

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Stacked Mixtures (Smyth and Wolpert, Machine Learning, 1999) Simple idea: use cross-validation to estimate the weights, rather than using a Bayesian scheme Two-phase learning 1. Learn each model M k on Dtrain 2. Learn mixture model weights on Dvalidation - components are fixed - EM just learns the weights Outperforms any single model selection technique Even outperforms “cheating”

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Query Approximation Query Generator Large Database Construct Probability Models Offline e.g., mixtures, belief networks, etc Provide Fast Query Answers Online Approximate Models

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Attributes Observation Vectors C1 C2 1 1 1 1 Treat as Missing P(C1|x1) P(C1|..) P(C2|x1) P(C2|..) E-Step: estimate component membership probabilities given current parameter estimates

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Attributes Observation Vectors C1 C2 1 1 1 1 Treat as Missing P(C1|x1) P(C1|..) P(C2|x1) P(C2|..) M-Step: use “fractional” weighted data to get new estimates of the parameters

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College,

Similar presentations

Presentation on theme: "© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College,

Similar presentations

Presentation on theme: "© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College,"— Presentation transcript:

Similar presentations

About project

Feedback