Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College,

Similar presentations


Presentation on theme: "© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College,"— Presentation transcript:

1 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College, MA June 29th 2001 Padhraic Smyth Information and Computer Science University of California, Irvine www.datalab.uci.edu

2 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Outline What are mixture models? –Definitions and examples How can we learn mixture models? –A brief history and illustration What are mixture models useful for? –Applications in Web and transaction data Recent research in mixtures?

3 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Acknowledgements Students: –Igor Cadez, Scott Gaffney, Xianping Ge, Dima Pavlov Collaborators –David Heckerman, Chris Meek, Heikki Mannila, Christine McLaren, Geoff McLachlan, David Wolpert Funding –NSF, NIH, NIST, KLA-Tencor, UCI Cancer Center, Microsoft Research, IBM Research, HNC Software.

4 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Finite Mixture Models

5 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Finite Mixture Models

6 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Finite Mixture Models

7 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Finite Mixture Models

8 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Finite Mixture Models Weight k Component Model k Parameters k

9 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Example: Mixture of Gaussians Gaussian mixtures:

10 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Example: Mixture of Gaussians Gaussian mixtures: Each mixture component is a multidimensional Gaussian with its own mean  k and covariance “shape”  k

11 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Example: Mixture of Gaussians Gaussian mixtures: Each mixture component is a multidimensional Gaussian with its own mean  k and covariance “shape”  k e.g., K=2, 1-dim: {} = { 1 ,  1,  2 ,  2,   }

12 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk

13

14

15 Example: Mixture of Naïve Bayes

16 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Example: Mixture of Naïve Bayes

17 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Example: Mixture of Naïve Bayes Conditional Independence model for each component (often quite useful to first-order)

18 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Mixtures of Naïve Bayes 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Terms Documents

19 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Mixtures of Naïve Bayes 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Terms Documents 1 1 1 1 Component 1 Component 2

20 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Other Component Models Mixtures of Rectangles –Pelleg and Moore (ICML, 2001) Mixtures of Trees –Meila and Jordan (2000) Mixtures of Curves –Quandt and Ramsey (1978) Mixtures of Sequences –Poulsen (1990)

21 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Interpretation of Mixtures 1. C has a direct (physical) interpretation e.g., C  {age of fish}, C = {male, female}

22 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Interpretation of Mixtures 1. C has a direct (physical) interpretation e.g., C  {age of fish}, C = {male, female} 2. C might have an interpretation e.g., clusters of Web surfers

23 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Interpretation of Mixtures 1. C has a direct (physical) interpretation e.g., C  {age of fish}, C = {male, female} 2. C might have an interpretation e.g., clusters of Web surfers 3. C is just a convenient latent variable e.g., flexible density estimation

24 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Graphical Models for Mixtures E.g., Mixtures of Naïve Bayes: C

25 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Graphical Models for Mixtures E.g., Mixtures of Naïve Bayes: C X1X1 X2X2 X3X3

26 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Graphical Models for Mixtures E.g., Mixtures of Naïve Bayes: (discrete, hidden) C (observed)X1X1 X2X2 X3X3

27 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Sequential Mixtures C X1X1 X2X2 X3X3 X1X1 X2X2 X3X3 C X1X1 X2X2 X3X3 C Time t-1Time tTime t+1

28 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Sequential Mixtures C X1X1 X2X2 X3X3 X1X1 X2X2 X3X3 C X1X1 X2X2 X3X3 C Time t-1Time tTime t+1 Markov Mixtures = C has Markov dependence = Hidden Markov Model (here with naïve Bayes) C = discrete state, couples observables through time

29 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Dynamic Mixtures Computer Vision mixtures of Kalman filters for tracking Atmospheric Science mixtures of curves and dynamical models for cyclones Economics mixtures of switching regressions for the US economy

30 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Limitations of Mixtures Discrete state space –not always appropriate –e.g., in modeling dynamical systems Training –no closed form solution, can be tricky Interpretability –many different mixture solutions may explain the same data

31 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Learning of mixture models

32 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Learning Mixtures from Data Consider fixed K e.g., Unknown parameters  = { 1 ,  1,  2 ,  2,   } Given data D = {x 1,…….x N }, we want to find the parameters  that “best fit” the data

33 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Early Attempts Weldon’s data, 1893 - n=1000 crabs from Bay of Naples - Ratio of forehead to body length - suspected existence of 2 separate species

34 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Early Attempts Karl Pearson, 1894: - JRSS paper - proposed a mixture of 2 Gaussians - 5 parameters  = { 1 ,  1,  2 ,  2,   } - parameter estimation -> method of moments - involved solution of 9th order equations! ( see Chapter 10, Stigler (1986), The History of Statistics )

35 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk “The solution of an equation of the ninth degree, where almost all powers, to the ninth, of the unknown quantity are existing, is, however, a very laborious task. Mr. Pearson has indeed possessed the energy to perform his heroic task…. But I fear he will have few successors…..” Charlier (1906)

36 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Maximum Likelihood Principle Fisher, 1922 –assume a probabilistic model –likelihood = p(data | parameters, model) –find the parameters that make the data most likely

37 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Maximum Likelihood Principle Fisher, 1922 –assume a probabilistic model –likelihood = p(data | parameters, model) –find the parameters that make the data most likely

38 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk 1977: The EM Algorithm Dempster, Laird, and Rubin –general framework for likelihood-based parameter estimation with missing data start with initial guesses of parameters Estep: estimate memberships given params Mstep: estimate params given memberships Repeat until convergence –converges to a (local) maximum of likelihood –Estep and Mstep are often computationally simple –generalizes to maximum a posteriori (with priors)

39 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk

40 Example of a Log-Likelihood Surface Log Scale for Sigma 2 Mean 2

41 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk

42

43

44

45

46

47

48

49

50 Anemia Group Control Group

51 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Data for an Individual Patient Healthy stateAnemic state (Cadez et al., Machine Learning, in press)

52 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Alternatives to EM Method of Moments –EM is more efficient Direct optimization –e.g., gradient descent, Newton methods –EM is simpler to implement Sampling (e.g., MCMC) Minimum distance, e.g.,

53 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk How many components? 2 general approaches: 1. Best density estimator e.g., what predicts best on new data 2. True number of components - cannot be answered from data alone

54 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Data-generating process (“truth”) K=2 Model Class “Closest” model in terms of KL distance

55 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Data-generating process (“truth”) K=10 Model Class

56 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Prescriptions for Model Selection Minimize distance to “truth” Maximize predictive “logp score” –gives an estimate of KL(model, truth) –pick model that predicts best on validation data Bayesian techniques –p(k|data): impossible to compute exactly –closed-form approximations: BIC, Autoclass, etc –sampling Monte Carlo techniques: quite tricky for mixtures

57 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Mixture model applications

58 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk What are Mixtures used for? Modeling heterogeneity –e.g., inferring multiple species in biology Handling missing data –e.g., variables and cases missing in model-building Density estimation –e.g., as flexible priors in Bayesian statistics Clustering –components as clusters Model Averaging –combining density models

59 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Mixtures of non-vector data Example –N individuals, and sets of sequences for each –e.g., Web session data Clustering of the N individuals? –Vectorize data and apply vector methods? –Pairwise distances of sets of sequences? –Parameter estimate for each individual and then cluster?

60 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Mixtures of {Sequences, Curves, …}

61 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Mixtures of {Sequences, Curves, …} Generative Model - select a component c k for individual i - generate data according to p(D i | c k ) - p(D i | c k ) can be very general - e.g., sets of sequences, spatial patterns, etc [Note: given p(D i | c k ), we can define an EM algorithm]

62 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Application 1: Web Log Visualization (Cadez, Heckerman, Meek, Smyth, KDD 2000) MSNBC Web logs –2 million individuals per day –different session lengths per individual –difficult visualization and clustering problem WebCanvas –uses mixtures of SFSMs to cluster individuals based on their observed sequences –software tool: EM mixture modeling + visualization

63 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk

64 Example: Mixtures of SFSMs Simple model for traversal on a Web site (equivalent to first-order Markov with end-state) Generative model for large sets of Web users - different behaviors mixture of SFSMs EM algorithm is quite simple: weighted counts

65 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk WebCanvas: Cadez, Heckerman, et al, KDD 2000

66 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk

67 Timing Results

68 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Transaction Data

69 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Profiling for Transaction Data Profiling –given transactions for a set of individuals –infer a predictive model for future transactions of individual i –typical applications: automated recommender systems: e.g., Amazon.com personalization: e.g., wireless information services –existing techniques collaborative filtering, association rules unsuited for prediction, e.g., inclusion of seasonality.

70 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Retail Data Set –200,000 individuals –all market baskets over 2 years –50 departments, 50k items Problem –predict what an individual will purchase in the future want to generalize across products want to allow heterogeneity Application 2: Transaction Data (Cadez, Smyth, Mannila, KDD 2001)

71 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Transaction Data

72 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Examples of Mixture Components Components “encode” typical combinations of clothes

73 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Mixture-based Profiles Predictive Profile for Individual i Basket model for Component k (“Basis function”) Probability that Individual i engages in “behavior” k, given that they enter the store

74 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Hierarchical Bayes Model Empirical Prior on Mixture Weights Individual 1 B1 Individual i B1B2B3 Individual N B1B2 Individuals with little data get “shrunk” to the prior Individuals with a lot of data are more data-driven

75 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Data and Profile Example

76 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Data and Profile Example

77 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Data and Profile Example

78 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Data and Profile Example

79 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Data and Profile Example No Training Data for 14 No Purchases above Dept 25

80 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk

81

82 Transaction mixtures Mixture-based profiles –interpretable and flexible –more accurate than non-mixture approaches –training time = linear(number of items) Applications –early detection of high-value customers –visualization and exploration –forecasting customer behavior

83 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Extensions of Mixtures

84 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Extensions : Multiple Causes Single “cause” variable C X

85 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Extensions: Multiple Causes Single “cause” variable Multiple Causes (“Factors”) –examples Hoffman (1999,2000) for text ICA for signals C X C2 X C1C3

86 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Extensions: High Dimensions Density estimation is difficult in high-d

87 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Extensions : High Dimensions Density estimation is difficult in high-d

88 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Extensions: High Dimensions Density estimation is difficult in high-d Global PCA

89 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Extensions: High Dimensions Density estimation is difficult in high-d Mixtures of PCA (Tipping and Bishop 1999)

90 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Extensions: Predictive Mixtures Standard mixture model

91 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Extensions: Predictive Mixtures Standard mixture model Conditional Mixtures: –e.g., mixtures of experts, Jordan and Jacobs (1994) Input-dependent

92 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Extensions: Learning Algorithms Fast algorithms kd-trees (Moore, 1998) Caching(Bradley et al.) Random projections: DasGupta (2000) Mean-squared error criteria Scott (2000) Bayesian techniques reversible-jump MCMC (Green, et al.)

93 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Classic References Statistical Analysis of Finite Mixture Distributions Titterington, Smith, and Makov Wiley, 1985 Finite Mixture Models McLachlan and Peel Wiley 2000

94 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Conclusions Mixtures –flexible “tool” in the machine learner’s toolbox Beyond mixtures of Gaussians… –mixtures of sequences, curves, trees, etc. Applications –numerous and broad

95 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk

96

97 Concavity of Likelihood (Cadez and Smyth, NIPS 2000)

98 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Application 3: Model Averaging Prediction of Model k Weight of Model k Bayesian model averaging for p(x): - since we don’t know which model (if any) is the true one, average out this uncertainty Prediction of x given data D

99 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Stacked Mixtures (Smyth and Wolpert, Machine Learning, 1999) Simple idea: use cross-validation to estimate the weights, rather than using a Bayesian scheme Two-phase learning 1. Learn each model M k on Dtrain 2. Learn mixture model weights on Dvalidation - components are fixed - EM just learns the weights Outperforms any single model selection technique Even outperforms “cheating”

100 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Query Approximation (Pavlov and Smyth, KDD 2001) Approximate Models Query Generator Large Database

101 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Query Approximation Large Database Construct Probability Models Offline e.g., mixtures, belief networks, etc Approximate Models Query Generator

102 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Query Approximation Query Generator Large Database Construct Probability Models Offline e.g., mixtures, belief networks, etc Provide Fast Query Answers Online Approximate Models

103 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk Stacking for Query Model Combining Conjunctive Queries on Microsoft Web Data, 32k records, 294 attributes Available online at UCI KDD Archive

104 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Attributes Observation Vectors 1 1 1 1

105 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Attributes Observation Vectors C1 C2 1 1 1 1 Treat as Missing C2

106 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Attributes Observation Vectors C1 C2 1 1 1 1 Treat as Missing P(C1|x1) P(C1|..) P(C2|x1) P(C2|..) E-Step: estimate component membership probabilities given current parameter estimates

107 © Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Attributes Observation Vectors C1 C2 1 1 1 1 Treat as Missing P(C1|x1) P(C1|..) P(C2|x1) P(C2|..) M-Step: use “fractional” weighted data to get new estimates of the parameters


Download ppt "© Padhraic Smyth, UC Irvine: ICML 01 Keynote Talk A Guided Tour of Finite Mixture Models: From Pearson to the Web ICML ‘01 Keynote Talk Williams College,"

Similar presentations


Ads by Google