Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette Hunt, Murray Jorgensen Mixture model clustering for mixed data with missing information Computation statistics & Data Analysis, 2002

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction The Mixture approach to Clustering Data Application Discussion Personal Opinion

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Missing observations are frequently seen in data sets. Specimen may be damaged result. Expensive test may only be administered to a random sub-sample of the items.

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective We need to implement some technique when the data to be clustered are incomplete. Extends mixture likelihood approach to analyse data with mixed categorical and continuous attributes and where some of the data are missing at random.

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction Data are described as ‘missing at random’ when the probability that a variable is missing for a particular individual may depend on the values of the observed variables, but not for on the value of the missing variable. The distribution of the missing data does not depend on the missing data.

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction Rubin(1976) showed the process that causes the missing data can be ignored when making likelihood-based about the parameter of the data if the data are ‘missing at random’. The EM algorithms of Dempster et al. is a general iterative procedure maximum likelihood estimation in incomplete data problems. Little and Schluchter(1985) present maximum likelihood procedure using the EM algorithms for the general location model with missing data.

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The Mixture approach to Clustering Data Suppose p attributes are measured on n individuals. Let xi,…, x n be the observed values of a random sample from a mixture of K populations in known proportions, π 1,…,π k Let the density of xi in the kth group be f k (xi; θ k ), where θ k is the parameter vector for group k. Let ψ=(θ’, π’)’, where π=(π 1,…,π k )’, θ=(θ 1,…, θ k )’

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The Mixture approach to Clustering Data In EM algorihm of Dempster et al., the ‘missing’ data are the unobserved indicators of group membership. Let the vector of indicator variables, z i =(z i1,…,z ik ) for k=1,…K; and xi is assigned to group k if z ik > z ik’, k != k’

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The Mixture approach to Clustering Data The latent class model is a finite mixture model for data where each of the p attributes is discrete. Suppose that the jth attribute can take on 1,…,M1 and let λkjm be the probability that for individuals from group k, the jth attribute has level m. Then, individual I belonging to group k is defined as

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Multimix Jorgensen and Hunt(1996) Hunt and Jorgensen(1999) proposed a general class of mixture models to include data having continuous and categorical attributes. By partitioning the observational vector xi such that If individual I belongs to group k, we can write

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Multimix Discrete distribution: where is a one-dimensional discrete attribute taking values 1,…M l with probabilities λ klM1 Multivariate Normal distribution: where is a pl-dimensional vector with a N pl ( μ kl,∑ kl )

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Graphical models A alternative way of looking at these multivariate models within the framework of graphical models. The graph of a model contains vertices and edges vertex corresponding to each variable. Edges shows the independence of corresponding vertices. Latent class models for p variable are represented by a graph on p+1 vertices corresponding to the variables plus 1 categorical variable indicating the cluster.

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Missing data We put forward a method for mixture model clustering based on the assumption that the data are missing at random. We write the observation vector xi in the form (x obs,i,x miss,i ) x obs,i is the observed attributes for observation i x miss,i is the missing attributes for observation i

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Missing data The E step of the EM algorithm require the calculation of Q( ψ, ψ (t) )=E{ L C (ψ)|x obs ; ψ (t) }, the expectation of the complete data log-likelihood conditional on the observed data and the current value of the parameters. We calculate Q( ψ, ψ (t) ) by replace z ik with

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Missing data The remaining calculations in the E step require the calculation of the expected value of the complete data sufficient statistics for each partition cell l.

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Missing data For multivariate normal partition cells, Eliminating one cluster at a time Calculate the between-cluster entropy based on remaining clusters

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Missing data Sweep is usefulness in maximum likelihood estimation for multivariate missing data problems. We form the augmented covariance matrix Al using the current estimates of the parameters for group k in cell l

18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Missing data Sweeping on the elements of A l corresponding to the observed x ij in cell l, yields the conditional distribution of the missing x ij’ on the observed x ij in the cell.

19 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Missing data The new parameter estimates θ (t+1) of parameters are estimated form the complete data sufficient statistic. Mixing proportion: Discrete distribution parameters:

20 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Missing data Multivariate Normal parameters:

21 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Application Prostate cancer clinical trial data of Byar and Green(1980). The data were obtained from a randomized clinical trial comparing 4 treatments for 506 patients with prostatic cancer. There are 12 pre-trial covariates measured on each patient, 7 variables may be taken to be continuous, 4 to be discrete and 1 variable (SG) is an index. We treat SG as a continuous variable.

22 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Application 1/3 individual have at least one of pre-trial covariates missing, giving a total of 62 missing values. As only approximately 1% of the data are missing. Missing values were created by assigning each attribute of each individual a random digit generated from the discrete[0,1], respectively, as.10,.15,.20,.25 and.30.

23 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Application The data set reported in detail here had 1870values recorded as missing. Separate data into two clusters. We regard the data as a random sample from the distribution

24 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Application

25 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Discussion The multimix approach allows to clustering of mixed finite data containing both types of variables. The finite mixture model leads itself well into coping with missing values. The approach implemented in this paper works well for mixed data set that had a very large amount of missing data.

26 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Opinion ……


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette."

Similar presentations


Ads by Google