A Novel Approach for imputation of Missing Values for Mining

A Novel Approach for imputation of Missing Values for Mining
Medical Datasets IEEE International Conference on Computational Intelligence and Computing Research(2015 IEEE ICCIC),Madhurai UshaRani .Y Dept of Information Technology VNR Vignana Jyothi Institute of Engineering & Technology, Hyderabad. ICIP

Abstract Real-world Medical datasets predominantly contain numeric (continuous) attributes and categorical (nominal) attributes with missing values. In this paper, we propose a novel imputation approach for fixing missing values.The approach is based on clustering concept and aims at dimensionality reduction of the records use the same records of lower dimension to be used for clustering and classification of medical records to arrive at accurate decision prediction. ICIP

Missing values can also cause misleading results.
Missing values are usually associated with the following reasons such as: the value might be lost (deleted or erased), not recorded, incorrect measurements and equipment errors…etc Missing values can also cause misleading results. Missing values(MV) are affecting the classification accuracy of the classifiers. To overcome these issues we propose a novel imputation approach for fixing missing values.The approach is based on clustering concept and aims at dimensionality reduction of the records. ICIP

RESEARCH ISSUES IN MINING MEDICAL DATA
Handling Medical Datasets Handling and Imputing Missing Values Choice of Prediction and Classifications algorithms Finding Nearest Medical Record and Identifying the Class Label Deciding on Medical Attributes Removal of Noise ICIP

Importance of Present Approach
The method may be used to find missing attribute values from medical records The same approach for finding missing values may be used to classify medical records The disease prediction may be achieved using the proposed approach without the need to adopt a separate procedure Handles all attribute types Preserves attribute information May be applied for datasets with and without class labels which is uniqueness of the current approach. ICIP

Proposed Approach Fig 1. Generating Clusters from medical records in group G1 equal to number of class labels in G CIP

The framework for missing value Imputation consists of following steps
1.Generating Clusters from Group G_1 This step involves finding the number of class labels and generating number of clusters equal to number of class labels The clusters may be generated using k-means algorithm by specifying value of k to be number of class labels. Alternately, we may apply any clustering algorithm which can generate k clusters ICIP

2.Computing distance of normal records to Cluster Centers
Obtain mean of each cluster. This shall be the cluster center Obtain distance of each medical record to each cluster center. Sum all distances obtained The result is all medical records mapped to single value achieving dimensionality reduction. ICIP

Fig 2. Computation of distance of medical record, R1 to each cluster center from clusters formed
Plagiarism is an issue in the academic environment and beyond. As real-life examples demonstrate, using information without crediting its original source can harm your credibility. During the 2008 federal election campaign, it was revealed that a speech given by Stephen Harper in 2003 had been plagiarized from a speech given by the Australian prime minister. The colour coded text highlights the dramatic similarity between the two texts. The Prime Minister’s speech writer resigned after the scandal, saying he resorted to copying the speech because he was pressed for time.

3.Computing distance of missing records to Cluster Centers
Obtain distance of each medical record having missing values to each cluster center by discarding those attributes with missing values. Sum all distances obtained The result is all medical records mapped to single value achieving dimensionality reduction. 4.Find Nearest Record to Impute Missing Values Consider each missing record in group, G2 one by one. Find the distance of this record to all the records in group G1.The record to which the distance is minimal, shall be the nearest neighbor. Perform imputation of the missing attribute value by considering the corresponding attribute value of nearest record in that class. The frequency may also be considered for imputation incase, we have more than one nearest neighbors ICIP

Proposed Algorithm Input: Medical Records with Missing Values
Output: Imputation of Missing Values Notations adopted: R_i ith medical record R_i (A_K) - k^th attribute value of i^th medical record 〖 G〗_c c^th group i,k index of medical records and attributes ∅ misisng record or Empty record value c number of decision classes in medical dataset D_d d^th decision class m total number of medical records n number of attributes in each record μ_d cluster center of d^th cluster μ_(dn ) - mean value of n^th attribute h number of records in group ,G_2 z - number of records in group ,G_1 equal to (m-h) Begin of Algorithm Procedure ICIP

Step-1: Read Medical Dataset
Read the medical dataset consisting of medical records. Find records with and without missing values. Classify records in to two groups, say G1 and G2. The first group, G1 is set of all medical records with no missing values. The second group, G2 is set of all medical records having missing values. 𝐺 1 = 𝑈 { 𝑅 𝑖 | 𝑅 𝑖 ( 𝐴 𝐾 ) ≠∅ , ∀ 𝑖,𝑘 } (3) 𝐺 2 = 𝑈 { 𝑅 𝑖 | 𝑅 𝑖 ( 𝐴 𝐾 ) =∅ / ∃ 𝑖, 𝑘 } (4) Where 𝑖 𝜖 (1,𝑚−ℎ) and 𝑘 𝜖 (1,𝑛).We may consider group, 𝐺 1 as training set of medical records while group, 𝐺 2 is considered as testing set in this case. Step-2: Cluster Medical Records with No Missing values Let, g = |Dd |, be the number of decision classes. Determine the maximum number of decision classes available in the medical dataset being considered. Cluster the medical records in group, 𝐺 1 to a number of clusters equal to g. i.e |Dd|.

Step-3: Obtain Cluster Center for each Cluster formed
This may be achieved using K-means clustering algorithm. This is because K-means algorithm requires the number of required clusters to be specified well ahead before clustering process is carried out. The output of step-2 is a set of clusters. i.e Number of output clusters is equal to‘g’. This is shown in Fig-1 where a set of medical records represented by 𝐺 1 are clustered in to ‘d’ clusters. computed are summed to obtain a single distance value. This distance is called Type-1distance value given by equation 2 below. Dist d ( 𝑹 𝒊 , 𝝁 𝒅 ) = ( 𝑅 𝑖1 − µ 𝑑1 ) 𝟐 + ( 𝑅 𝑖2 − µ 𝑑2 ) 𝟐 +… ( 𝑅 𝑖𝑛 − µ 𝑑𝑛 ) 𝟐 (8) ∀ 𝑖 𝜖 (1,𝑛),∀𝑑 At the end of Step-4 we have distance value from each record , 𝑅 𝑖 to each cluster center denoted by 𝜇 𝑑 . Step-3: Obtain Cluster Center for each Cluster formed This step involves finding the cluster center for each cluster which is generated using the k-means clustering algorithm. We can obtain the cluster center by finding the mean of each attribute from attribute set, 𝐴 𝐾 of medical attributes. ICIP

Let Cluster- Cd denotes dth cluster having the records R1, R6, R8 and R9 with single attribute. Then the cluster center is given by 𝜇 𝑑 = 𝑅 1 ( 𝐴 1 ) + 𝑅 6 ( 𝐴 1 ) + 𝑅 8 ( 𝐴 1 ) + 𝑅 9 ( 𝐴 1 ) 4 (5) In general the cluster center of gth cluster may be obtained using the generalized equation given below 𝜇 𝑔 = 𝑈 𝑘 [ { ∑ 𝑅 𝑙 𝑘 | 𝑙 𝜖 {1,𝑞} 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑘 𝜖 {1,𝑛} } |𝑙| ] (6) 𝜇 𝑔 is hence a sequence of ‘n’ values indicating cluster center over ‘n’ attributes. The notation, 𝑈 𝑘 is used to denote set of all values each separated by a symbol comma. The cluster center may hence be formally represented using the representation 𝜇 𝑔 = < 𝜇 𝑔1 , 𝜇 𝑔2 , 𝜇 𝑔3 , 𝜇 𝑔4 ,……. 𝜇 𝑔𝑛 > (7) Here ‘n’ indicates total number of attributes in each medical record and |g| indicates number of clusters.

Fig 4 G1 Before Clustering
Fig 5 Before and After Clustring

case study In this Section-V, we discuss case study to find missing attributes values of medical records by using the proposed approach. For this, we consider a sample dataset consisting sample values. Consider Table. I, shown below consisting of sample dataset of medical records having categorical and numerical values. Table. II shows medical records without missing values after normalizing sample dataset. Table.III denotes records with and without missing values. Table IV denotes all records without missing values and Table. V shows records with missing attribute values. Table.VI depicts clusters generated from group G1 , which consists medical records with no missing values after applying k-means algorithm. There are two clusters generated C1 and C2. C1 contains set of all medical records {R1,R4,R6,R9} and C2 contains set of all medical records{R2, R7, R8 }. Table.VII gives the distances of records in group, G1 to cluster center of the first cluster. Similarly, Table.VIII gives the distances of records in group, G2 to cluster center of the second cluster. Many cases of plagiarism are unintentional. Often students do not understand what plagiarism is or how to properly reference and paraphrase. Now that you have the definition of plagiarism, you have won half the battle.

Table.IX depicts computation values of mapping function of records of group, G1. The mapping function 𝑀𝑎𝑝( 𝑅 𝑖 ) is mapping distance of ith record, which is sum of all distances from record, 𝑅 𝑖 to each of those cluster centers generated from application of clustering algorithm. Table. X gives the distances of medical records in group, G2 to each of the cluster centers. Table. XI depicts computation values of mapping function of medical records containing missing values of group, G2. The mapping function 𝑀𝑎𝑝 𝒓 ( 𝑅 𝑗 ) is mapping distance of jth record, What are some common forms of plagiarism?

TABLE I. NORMALIZED SAMPLE DATASET OF MEDICAL RECORDS
TABLE II. MEDICAL RECORDS WITH AND WITHOUT MISSING VALUES Plagiarism has serious consequences. As we saw in the earlier examples, it can cost a person his or her professional credibility or even a job. In the academic environment, plagiarism can result in a zero grade on an assignment, or expulsion from the university. In some cases, people who have plagiarized have actually had their degrees rescinded.

TABLE VII. DISTANCE OF MEDICAL RECORDS TO CLUSTER-1
TABLE VIII. DISTANCE OF MEDICAL RECORDS TO CLUSTER-2

XIIV DISTANCE OF MEDICAL RECORDS R5 WITH OTHER RECORDS
TABLE XV NEASREST MEDICAL RECORD FOR RECORD R5

Conclusion In the paper we address the first challenge of handling missing values in medical datasets. We also address how the dimensionality reduction of medical datasets may be achieved in a simple approach. We also discussed with a new approach of finding missing values in datasets not addressed in the literature by aiming at a single dimension. The approach followed does not miss any attribute information while carrying out dimensionality reduction which is the importance of this approach. The proposed approach of imputing missing values in medical records is feasible for both categorical and numerical attributes as discussed in case study.

References Zhang, S, Zhenxing Qin, Ling C.X, Sheng S, " "Missing is useful": missing values in cost-sensitive decision trees,", IEEE Transactions on Knowledge and Data Engineering, vol.17, no.12, pp , 2005. Zhang, C,Yongsong Qin, Xiaofeng Zhu, Jilian Zhang, and Zhang,S, "Clustering-based Missing Value Imputation for Data Preprocessing," in , 2006 IEEE International Conference on Industrial Informatics, pp , 2006. Wang, Ling, Fu Dongmei, Li Qing, Mu Zhichun, "Modelling method with missing values based on clustering and support vector regression," , Journal of Systems Engineering and Electronics , vol.21, no.1, pp , 2010. Kirkpatrick B, Stevens K, " Perfect Phylogeny Problems with Missing Values," IEEE/ACM Transactions on Computational Biology and Bioinformatics,Vol.11,No.5,pp ,2014. Xiaofeng Zhu, Zhang S, Zhi Jin, Zili Zhang, and Zhuoming Xu, "Missing Value Estimation for Mixed-Attribute Data Sets", IEEE Transactions on Knowledge and Data Engineering, Vol.23, No.1, pp , Farhangfar A, Kurgan L.A, Pedrycz ,"A Novel Framework for Imputation of Missing Values in Databases," in Part A: Systems and Humans, IEEE Transactions on Systems, Man and Cybernetics, Vol.37, No.5,pp , 2007. Miew Keen Choong,Charbit M, Hong Yan, "Autoregressive-ModelBased Missing Value Estimation for DNA Microarray Time Series Data,",IEEE Transactions on Information Technology in Biomedicine,Vol.13, No.1,pp , 2009

Some handy tips to keep in mind:

A Novel Approach for imputation of Missing Values for Mining

Similar presentations

Presentation on theme: "A Novel Approach for imputation of Missing Values for Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Novel Approach for imputation of Missing Values for Mining

Similar presentations

Presentation on theme: "A Novel Approach for imputation of Missing Values for Mining"— Presentation transcript:

Similar presentations

About project

Feedback