A Novel Approach for imputation of Missing Values for Mining

Slides:



Advertisements
Similar presentations
Request Dispatching for Cheap Energy Prices in Cloud Data Centers
Advertisements

SpringerLink Training Kit
Luminosity measurements at Hadron Colliders
From Word Embeddings To Document Distances
Choosing a Dental Plan Student Name
Virtual Environments and Computer Graphics
Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI
THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –
D. Phát triển thương hiệu
NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN
Điều trị chống huyết khối trong tai biến mạch máu não
BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.
Nasal Cannula X particulate mask
Evolving Architecture for Beyond the Standard Model
HF NOISE FILTERS PERFORMANCE
Electronics for Pedestrians – Passive Components –
Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel
L-Systems and Affine Transformations
CMSC423: Bioinformatic Algorithms, Databases and Tools
Some aspect concerning the LMDZ dynamical core and its use
Bayesian Confidence Limits and Intervals
实习总结 (Internship Summary)
Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,
Front End Electronics for SOI Monolithic Pixel Sensor
Face Recognition Monday, February 1, 2016.
Solving Rubik's Cube By: Etai Nativ.
CS284 Paper Presentation Arpad Kovacs
انتقال حرارت 2 خانم خسرویار.
Summer Student Program First results
Theoretical Results on Neutrinos
HERMESでのHard Exclusive生成過程による 核子内クォーク全角運動量についての研究
Wavelet Coherence & Cross-Wavelet Transform
yaSpMV: Yet Another SpMV Framework on GPUs
Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.
MOCLA02 Design of a Compact L-­band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Fuel cell development program for electric vehicle
Overview of TST-2 Experiment
Optomechanics with atoms
داده کاوی سئوالات نمونه
Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium  
ლექცია 4 - ფული და ინფლაცია
10. predavanje Novac i financijski sustav
Wissenschaftliche Aussprache zur Dissertation
FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,
Particle acceleration during the gamma-ray flares of the Crab Nebular
Interpretations of the Derivative Gottfried Wilhelm Leibniz
Advisor: Chiuyuan Chen Student: Shao-Chun Lin
Widow Rockfish Assessment
SiW-ECAL Beam Test 2015 Kick-Off meeting
On Robust Neighbor Discovery in Mobile Wireless Networks
Chapter 6 并发:死锁和饥饿 Operating Systems: Internals and Design Principles
You NEED your book!!! Frequency Distribution
Y V =0 a V =V0 x b b V =0 z
Fairness-oriented Scheduling Support for Multicore Systems
Climate-Energy-Policy Interaction
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Ch48 Statistics by Chtan FYHSKulai
The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs
Online Learning: An Introduction
Factor Based Index of Systemic Stress (FISS)
What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.
THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*
Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.
The Toroidal Sporadic Source: Understanding Temporal Variations
FW 3.4: More Circle Practice
ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف
Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM
Limits on Anomalous WWγ and WWZ Couplings from DØ
Presentation transcript:

A Novel Approach for imputation of Missing Values for Mining Medical Datasets IEEE International Conference on Computational Intelligence and Computing Research(2015 IEEE ICCIC),Madhurai UshaRani .Y Dept of Information Technology VNR Vignana Jyothi Institute of Engineering & Technology, Hyderabad. E-mail: usharani_y@vnrvjiet.in ICIP

Abstract Real-world Medical datasets predominantly contain numeric (continuous) attributes and categorical (nominal) attributes with missing values. In this paper, we propose a novel imputation approach for fixing missing values.The approach is based on clustering concept and aims at dimensionality reduction of the records use the same records of lower dimension to be used for clustering and classification of medical records to arrive at accurate decision prediction. ICIP

Missing values can also cause misleading results. Missing values are usually associated with the following reasons such as: the value might be lost (deleted or erased), not recorded, incorrect measurements and equipment errors…etc Missing values can also cause misleading results. Missing values(MV) are affecting the classification accuracy of the classifiers. To overcome these issues we propose a novel imputation approach for fixing missing values.The approach is based on clustering concept and aims at dimensionality reduction of the records. ICIP

RESEARCH ISSUES IN MINING MEDICAL DATA Handling Medical Datasets Handling and Imputing Missing Values Choice of Prediction and Classifications algorithms Finding Nearest Medical Record and Identifying the Class Label Deciding on Medical Attributes Removal of Noise ICIP

Importance of Present Approach The method may be used to find missing attribute values from medical records The same approach for finding missing values may be used to classify medical records The disease prediction may be achieved using the proposed approach without the need to adopt a separate procedure Handles all attribute types Preserves attribute information May be applied for datasets with and without class labels which is uniqueness of the current approach. ICIP

Proposed Approach Fig 1. Generating Clusters from medical records in group G1 equal to number of class labels in G CIP

The framework for missing value Imputation consists of following steps 1.Generating Clusters from Group G_1 This step involves finding the number of class labels and generating number of clusters equal to number of class labels The clusters may be generated using k-means algorithm by specifying value of k to be number of class labels. Alternately, we may apply any clustering algorithm which can generate k clusters ICIP

2.Computing distance of normal records to Cluster Centers Obtain mean of each cluster. This shall be the cluster center Obtain distance of each medical record to each cluster center. Sum all distances obtained The result is all medical records mapped to single value achieving dimensionality reduction. ICIP

Fig 2. Computation of distance of medical record, R1 to each cluster center from clusters formed Plagiarism is an issue in the academic environment and beyond. As real-life examples demonstrate, using information without crediting its original source can harm your credibility. During the 2008 federal election campaign, it was revealed that a speech given by Stephen Harper in 2003 had been plagiarized from a speech given by the Australian prime minister. The colour coded text highlights the dramatic similarity between the two texts. The Prime Minister’s speech writer resigned after the scandal, saying he resorted to copying the speech because he was pressed for time.

3.Computing distance of missing records to Cluster Centers Obtain distance of each medical record having missing values to each cluster center by discarding those attributes with missing values. Sum all distances obtained The result is all medical records mapped to single value achieving dimensionality reduction. 4.Find Nearest Record to Impute Missing Values Consider each missing record in group, G2 one by one. Find the distance of this record to all the records in group G1.The record to which the distance is minimal, shall be the nearest neighbor. Perform imputation of the missing attribute value by considering the corresponding attribute value of nearest record in that class. The frequency may also be considered for imputation incase, we have more than one nearest neighbors ICIP

Proposed Algorithm Input: Medical Records with Missing Values Output: Imputation of Missing Values Notations adopted: R_i - ith medical record R_i (A_K) - k^th attribute value of i^th medical record 〖 G〗_c - c^th group i,k - index of medical records and attributes ∅ - misisng record or Empty record value c - number of decision classes in medical dataset D_d - d^th decision class m - total number of medical records n - number of attributes in each record μ_d - cluster center of d^th cluster μ_(dn ) - mean value of n^th attribute h - number of records in group ,G_2 z - number of records in group ,G_1 equal to (m-h) Begin of Algorithm Procedure ICIP

Step-1: Read Medical Dataset Read the medical dataset consisting of medical records. Find records with and without missing values. Classify records in to two groups, say G1 and G2. The first group, G1 is set of all medical records with no missing values. The second group, G2 is set of all medical records having missing values. 𝐺 1 = 𝑈 { 𝑅 𝑖 | 𝑅 𝑖 ( 𝐴 𝐾 ) ≠∅ , ∀ 𝑖,𝑘 } (3) 𝐺 2 = 𝑈 { 𝑅 𝑖 | 𝑅 𝑖 ( 𝐴 𝐾 ) =∅ / ∃ 𝑖, 𝑘 } (4) Where 𝑖 𝜖 (1,𝑚−ℎ) and 𝑘 𝜖 (1,𝑛).We may consider group, 𝐺 1 as training set of medical records while group, 𝐺 2 is considered as testing set in this case. Step-2: Cluster Medical Records with No Missing values Let, g = |Dd |, be the number of decision classes. Determine the maximum number of decision classes available in the medical dataset being considered. Cluster the medical records in group, 𝐺 1 to a number of clusters equal to g. i.e |Dd|.

Step-3: Obtain Cluster Center for each Cluster formed This may be achieved using K-means clustering algorithm. This is because K-means algorithm requires the number of required clusters to be specified well ahead before clustering process is carried out. The output of step-2 is a set of clusters. i.e Number of output clusters is equal to‘g’. This is shown in Fig-1 where a set of medical records represented by 𝐺 1 are clustered in to ‘d’ clusters. computed are summed to obtain a single distance value. This distance is called Type-1distance value given by equation 2 below. Dist d ( 𝑹 𝒊 , 𝝁 𝒅 ) = ( 𝑅 𝑖1 − µ 𝑑1 ) 𝟐 + ( 𝑅 𝑖2 − µ 𝑑2 ) 𝟐 +… ( 𝑅 𝑖𝑛 − µ 𝑑𝑛 ) 𝟐 (8) ∀ 𝑖 𝜖 (1,𝑛),∀𝑑 At the end of Step-4 we have distance value from each record , 𝑅 𝑖 to each cluster center denoted by 𝜇 𝑑 . Step-3: Obtain Cluster Center for each Cluster formed This step involves finding the cluster center for each cluster which is generated using the k-means clustering algorithm. We can obtain the cluster center by finding the mean of each attribute from attribute set, 𝐴 𝐾 of medical attributes. ICIP

Let Cluster- Cd denotes dth cluster having the records R1, R6, R8 and R9 with single attribute. Then the cluster center is given by 𝜇 𝑑 = 𝑅 1 ( 𝐴 1 ) + 𝑅 6 ( 𝐴 1 ) + 𝑅 8 ( 𝐴 1 ) + 𝑅 9 ( 𝐴 1 ) 4 (5) In general the cluster center of gth cluster may be obtained using the generalized equation given below 𝜇 𝑔 = 𝑈 𝑘 [ { ∑ 𝑅 𝑙 𝑘 | 𝑙 𝜖 {1,𝑞} 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑘 𝜖 {1,𝑛} } |𝑙| ] (6) 𝜇 𝑔 is hence a sequence of ‘n’ values indicating cluster center over ‘n’ attributes. The notation, 𝑈 𝑘 is used to denote set of all values each separated by a symbol comma. The cluster center may hence be formally represented using the representation 𝜇 𝑔 = < 𝜇 𝑔1 , 𝜇 𝑔2 , 𝜇 𝑔3 , 𝜇 𝑔4 ,……. 𝜇 𝑔𝑛 > (7) Here ‘n’ indicates total number of attributes in each medical record and |g| indicates number of clusters.

Fig 4 G1 Before Clustering Fig 5 Before and After Clustring

 case study In this Section-V, we discuss case study to find missing attributes values of medical records by using the proposed approach. For this, we consider a sample dataset consisting sample values. Consider Table. I, shown below consisting of sample dataset of medical records having categorical and numerical values. Table. II shows medical records without missing values after normalizing sample dataset. Table.III denotes records with and without missing values. Table IV denotes all records without missing values and Table. V shows records with missing attribute values. Table.VI depicts clusters generated from group G1 , which consists medical records with no missing values after applying k-means algorithm. There are two clusters generated C1 and C2. C1 contains set of all medical records {R1,R4,R6,R9} and C2 contains set of all medical records{R2, R7, R8 }. Table.VII gives the distances of records in group, G1 to cluster center of the first cluster. Similarly, Table.VIII gives the distances of records in group, G2 to cluster center of the second cluster. Many cases of plagiarism are unintentional. Often students do not understand what plagiarism is or how to properly reference and paraphrase. Now that you have the definition of plagiarism, you have won half the battle.

Table.IX depicts computation values of mapping function of records of group, G1. The mapping function 𝑀𝑎𝑝( 𝑅 𝑖 ) is mapping distance of ith record, which is sum of all distances from record, 𝑅 𝑖 to each of those cluster centers generated from application of clustering algorithm. Table. X gives the distances of medical records in group, G2 to each of the cluster centers. Table. XI depicts computation values of mapping function of medical records containing missing values of group, G2. The mapping function 𝑀𝑎𝑝 𝒓 ( 𝑅 𝑗 ) is mapping distance of jth record, What are some common forms of plagiarism?

TABLE I. NORMALIZED SAMPLE DATASET OF MEDICAL RECORDS TABLE II. MEDICAL RECORDS WITH AND WITHOUT MISSING VALUES Plagiarism has serious consequences. As we saw in the earlier examples, it can cost a person his or her professional credibility or even a job. In the academic environment, plagiarism can result in a zero grade on an assignment, or expulsion from the university. In some cases, people who have plagiarized have actually had their degrees rescinded.

TABLE VII. DISTANCE OF MEDICAL RECORDS TO CLUSTER-1 TABLE VIII. DISTANCE OF MEDICAL RECORDS TO CLUSTER-2

XIIV DISTANCE OF MEDICAL RECORDS R5 WITH OTHER RECORDS TABLE XV NEASREST MEDICAL RECORD FOR RECORD R5

Conclusion In the paper we address the first challenge of handling missing values in medical datasets. We also address how the dimensionality reduction of medical datasets may be achieved in a simple approach. We also discussed with a new approach of finding missing values in datasets not addressed in the literature by aiming at a single dimension. The approach followed does not miss any attribute information while carrying out dimensionality reduction which is the importance of this approach. The proposed approach of imputing missing values in medical records is feasible for both categorical and numerical attributes as discussed in case study.

References Zhang, S, Zhenxing Qin, Ling C.X, Sheng S, " "Missing is useful": missing values in cost-sensitive decision trees,", IEEE Transactions on Knowledge and Data Engineering, vol.17, no.12, pp.1689-1693, 2005. Zhang, C,Yongsong Qin, Xiaofeng Zhu, Jilian Zhang, and Zhang,S, "Clustering-based Missing Value Imputation for Data Preprocessing," in , 2006 IEEE International Conference on Industrial Informatics, pp.1081-1086, 2006. Wang, Ling, Fu Dongmei, Li Qing, Mu Zhichun, "Modelling method with missing values based on clustering and support vector regression," , Journal of Systems Engineering and Electronics , vol.21, no.1, pp.142- 147, 2010. Kirkpatrick B, Stevens K, " Perfect Phylogeny Problems with Missing Values," IEEE/ACM Transactions on Computational Biology and Bioinformatics,Vol.11,No.5,pp.928-941,2014. Xiaofeng Zhu, Zhang S, Zhi Jin, Zili Zhang, and Zhuoming Xu, "Missing Value Estimation for Mixed-Attribute Data Sets", IEEE Transactions on Knowledge and Data Engineering, Vol.23, No.1, pp.110-121, 2011 . Farhangfar A, Kurgan L.A, Pedrycz ,"A Novel Framework for Imputation of Missing Values in Databases," in Part A: Systems and Humans, IEEE Transactions on Systems, Man and Cybernetics, Vol.37, No.5,pp.692-709, 2007. Miew Keen Choong,Charbit M, Hong Yan, "Autoregressive-ModelBased Missing Value Estimation for DNA Microarray Time Series Data,",IEEE Transactions on Information Technology in Biomedicine,Vol.13, No.1,pp.131-137, 2009

Some handy tips to keep in mind: