1 CSE 300 Data mining and its application and usage in medicine By Radhika.

1 CSE 300 Data mining and its application and usage in medicine By Radhika

2 CSE 300 Data Mining and Medicine  History  Past 20 years with relational databases  More dimensions to database queries  earliest and most successful area of data mining  Mid 1800s in London hit by infectious disease  Two theories –Miasma theory  Bad air propagated disease –Germ theory  Water-borne  Advantages –Discover trends even when we don’t understand reasons –Discover irrelevant patterns that confuse than enlighten –Protection against unaided human inference of patterns provide quantifiable measures and aid human judgment  Data Mining  Patterns persistent and meaningful  Knowledge Discovery of Data

3 CSE 300 The future of data mining  10 biggest killers in the US  Data mining = Process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data

4 CSE 300 Major Issues in Medical Data Mining  Heterogeneity of medical data  Volume and complexity  Physician’s interpretation  Poor mathematical categorization  Canonical Form  Solution: Standard vocabularies, interfaces between different sources of data integrations, design of electronic patient records  Ethical, Legal and Social Issues  Data Ownership  Lawsuits  Privacy and Security of Human Data  Expected benefits  Administrative Issues

5 CSE 300 Why Data Preprocessing?  Patient records consist of clinical, lab parameters, results of particular investigations, specific to tasks  Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  Noisy: containing errors or outliers  Inconsistent: containing discrepancies in codes or names  Temporal chronic diseases parameters  No quality data, no quality mining results!  Data warehouse needs consistent integration of quality data  Medical Domain, to handle incomplete, inconsistent or noisy data, need people with domain knowledge

6 CSE 300 What is Data Mining? The KDD Process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

7 CSE 300 From Tables and Spreadsheets to Data Cubes  A data warehouse is based on a multidimensional data model that views data in the form of a data cube  A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions  Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)  Fact table contains measures (such as dollars_sold) and keys to each of related dimension tables  W. H. Inmon: “ A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management ’ s decision-making process. ”

8 CSE 300 Data Warehouse vs. Heterogeneous DBMS  Data warehouse: update-driven, high performance  Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis  Do not contain most current information  Query processing does not interfere with processing at local sources  Store and integrate historical information  Support complex multidimensional queries

9 CSE 300 Data Warehouse vs. Operational DBMS  OLTP (on-line transaction processing)  Major task of traditional relational DBMS  Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.  OLAP (on-line analytical processing)  Major task of data warehouse system  Data analysis and decision making  Distinct features (OLTP vs. OLAP):  User and system orientation: customer vs. market  Data contents: current, detailed vs. historical, consolidated  Database design: ER + application vs. star + subject  View: current, local vs. evolutionary, integrated  Access patterns: update vs. read-only but complex queries

10 CSE 300

11 CSE 300 Why Separate Data Warehouse?  High performance for both systems  DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery  Warehouse tuned for OLAP: complex OLAP queries, multidimensional view, consolidation  Different functions and different data:  Missing data: Decision support requires historical data which operational DBs do not typically maintain  Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources  Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

12 CSE 300

13 CSE 300

14 CSE 300 Typical OLAP Operations  Roll up (drill-up): summarize data  by climbing up hierarchy or by dimension reduction  Drill down (roll down): reverse of roll-up  from higher level summary to lower level summary or detailed data, or introducing new dimensions  Slice and dice:  project and select  Pivot (rotate):  reorient the cube, visualization, 3D to series of 2D planes.  Other operations  drill across: involving (across) more than one fact table  drill through: through the bottom level of the cube to its back-end relational tables (using SQL)

15 CSE 300

16 CSE 300

17 CSE 300 Multi-Tiered Architecture Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Monitor & Integrator Metadata Data Sources Front-End Tools Serve Data Marts Operational DBs other sources Data Storage OLAP Server

18 CSE 300 Steps of a KDD Process  Learning the application domain:  relevant prior knowledge and goals of application  Creating a target data set: data selection  Data cleaning and preprocessing: (may take 60% of effort!)  Data reduction and transformation:  Find useful features, dimensionality/variable reduction, invariant representation.  Choosing functions of data mining  summarization, classification, regression, association, clustering.  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc.  Use of discovered knowledge

19 CSE 300 Common Techniques in Data Mining  Predictive Data Mining  Most important  Classification: Relate one set of variables in data to response variables  Regression: estimate some continuous value  Descriptive Data Mining  Clustering: Discovering groups of similar instances  Association rule extraction  Variables/Observations  Summarization of group descriptions

20 CSE 300Leukemia  Different types of cells look very similar  Given a number of samples (patients)  can we diagnose the disease accurately?  Predict the outcome of treatment?  Recommend best treatment based of previous treatments?  Solution: Data mining on micro-array data  38 training patients, 34 testing patients ~ 7000 patient attributes  2 classes: Acute Lymphoblastic Leukemia(ALL) vs Acute Myeloid Leukemia (AML)

21 CSE 300 Clustering/Instance Based Learning  Uses specific instances to perform classification than general IF THEN rules  Nearest Neighbor classifier  Most studied algorithms for medical purposes  Clustering– Partitioning a data set into several groups (clusters) such that  Homogeneity: Objects belonging to the same cluster are similar to each other  Separation: Objects belonging to different clusters are dissimilar to each other.  Three elements  The set of objects  The set of attributes  Distance measure

22 CSE 300 Measure the Dissimilarity of Objects  Find best matching instance  Distance function  Measure the dissimilarity between a pair of data objects  Things to consider  Usually very different for interval-scaled, boolean, nominal, ordinal and ratio-scaled variables  Weights should be associated with different variables based on applications and data semantic  Quality of a clustering result depends on both the distance measure adopted and its implementation

23 CSE 300 Minkowski Distance  Minkowski distance: a generalization  If q = 2, d is Euclidean distance  If q = 1, d is Manhattan distance xixi xjxj q=2q=1 6 6 12 8.48 X i (1,7) X j (7,1)

24 CSE 300 Binary Variables  A contingency table for binary data  Simple matching coefficient Object i Object j

25 CSE 300 Dissimilarity between Binary Variables  Example A1A2A3A4A5A6A7 Object 1 1011100 Object 2 1110001 Object 1 Object 210sum1224 0213 sum437

26 CSE 300 K-nearest neighbors algorithm  Initialization  Arbitrarily choose k objects as the initial cluster centers (centroids)  Iteration until no change  For each object O i  Calculate the distances between O i and the k centroids  (Re)assign O i to the cluster whose centroid is the closest to O i  Update the cluster centroids based on current assignment

27 CSE 300 k-Means Clustering Method cluster mean current clusters new clusters objects relocated

28 CSE 300Dataset  Data set from UCI repository  http://kdd.ics.uci.edu/ http://kdd.ics.uci.edu/  768 female Pima Indians evaluated for diabetes  After data cleaning 392 data entries

29 CSE 300 Hierarchical Clustering  Groups observations based on dissimilarity  Compacts database into “labels” that represent the observations  Measure of similarity/Dissimilarity  Euclidean Distance  Manhattan Distance  Types of Clustering  Single Link  Average Link  Complete Link

30 CSE 300 Hierarchical Clustering: Comparison Average-link Centroid distance 1 2 3 4 5 6 1 2 5 3 4 Single-linkComplete-link 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 3 4 5

31 CSE 300 Compare Dendrograms 1 2 5 3 6 4 2 5 3 6 4 1 Average-link Centroid distance Single-link Complete-link

32 CSE 300 Which Distance Measure is Better?  Each method has both advantages and disadvantages; application-dependent  Single-link  Can find irregular-shaped clusters  Sensitive to outliers  Complete-link, Average-link, and Centroid distance  Robust to outliers  Tend to break large clusters  Prefer spherical clusters

33 CSE 300 Dendrogram from dataset  Minimum spanning tree through the observations  Single observation that is last to join the cluster is patient whose blood pressure is at bottom quartile, skin thickness is at bottom quartile and BMI is in bottom half  Insulin was however largest and she is 59-year old diabetic

34 CSE 300 Dendrogram from dataset  Maximum dissimilarity between observations in one cluster when compared to another

35 CSE 300 Dendrogram from dataset  Average dissimilarity between observations in one cluster when compared to another

36 CSE 300 Supervised versus Unsupervised Learning  Supervised learning (classification)  Supervision: Training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations  New data is classified based on training set  Unsupervised learning (clustering)  Class labels of training data are unknown  Given a set of measurements, observations, etc., need to establish existence of classes or clusters in data

37 CSE 300  Derive models that can use patient specific information, aid clinical decision making  Apriori decision on predictors and variables to predict  No method to find predictors that are not present in the data  Numeric Response  Least Squares Regression  Categorical Response  Classification trees  Neural Networks  Support Vector Machine  Decision models  Prognosis, Diagnosis and treatment planning  Embed in clinical information systems Classification and Prediction

38 CSE 300 Least Squares Regression  Find a linear function of predictor variables that minimize the sum of square difference with response  Supervised learning technique  Predict insulin in our dataset :glucose and BMI

39 CSE 300 Decision Trees  Decision tree  Each internal node tests an attribute  Each branch corresponds to attribute value  Each leaf node assigns a classification  ID3 algorithm  Based on training objects with known class labels to classify testing objects  Rank attributes with information gain measure  Minimal height  least number of tests to classify an object  Used in commercial tools eg: Clementine  ASSISTANT  Deal with medical datasets  Incomplete data  Discretize continuous variables  Prune unreliable parts of tree  Classify data

40 CSE 300 Decision Trees

41 CSE 300 Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm)  Attributes are categorical (if continuous-valued, they are discretized in advance)  Tree is constructed in a top-down recursive divide-and-conquer manner  At start, all training examples are at the root  Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain)  Examples are partitioned recursively based on selected attributes

42 CSE 300 Training Dataset AgeBMIHereditaryVision Risk of Condition X P1<=30highnofairno P2<=30highnoexcellentno P3>40highnofairyes P4 31 … 40 mediumnofairyes P5 lowyesfairyes P6 lowyesexcellentno P7>40lowyesexcellentyes P8<=30mediumnofairno P9<=30lowyesfairyes P10 mediumyesfairyes P11<=30mediumyesexcellentyes P12>40mediumnoexcellentyes P13>40highyesfairyes P14 mediumnoexcellentno

43 CSE 300 Construction of A Decision Tree for “ Condition X ” Age? >40 30…40 <=30 [P1,…P14] Yes: 9, No:5 [P1,P2,P8,P9,P11] Yes: 2, No:3 [P3,P7,P12,P13] Yes: 4, No:0 [P4,P5,P6,P10,P14] Yes: 3, No:2 History noyes YES [P1,P2,P8] Yes: 0, No:3 [P9,P11] Yes: 2, No:0 Vision fairexcellent NOYES NO YES [P6,P14] Yes: 0, No:2 [P4,P5,P10] Yes: 3, No:0

44 CSE 300 Entropy and Information Gain  S contains s i tuples of class C i for i = {1,..., m}  Information measures info required to classify any arbitrary tuple  Entropy of attribute A with values {a 1,a 2, …,a v }  Information gained by branching on attribute A

45 CSE 300 Entropy and Information Gain  Select attribute with the highest information gain (or greatest entropy reduction)  Such attribute minimizes information needed to classify samples

46 CSE 300 Rule Induction  IF conditions THEN Conclusion  Eg: CN2  Concept description:  Characterization: provides a concise and succinct summarization of given collection of data  Comparison: provides descriptions comparing two or more collections of data  Training set, testing set  Imprecise  Predictive Accuracy  P/P+N

47 CSE 300 Example used in a Clinic  Hip arthoplasty trauma surgeon predict patient’s long- term clinical status after surgery  Outcome evaluated during follow-ups for 2 years  2 modeling techniques  Naïve Bayesian classifier  Decision trees  Bayesian classifier  P(outcome=good) = 0.55 (11/20 good)  Probability gets updated as more attributes are considered  P(timing=good|outcome=good) = 9/11 (0.846)  P(outcome = bad) = 9/20 P(timing=good|outcome=bad) = 5/9

48 CSE 300 Nomogram

49 CSE 300 Bayesian Classification  Bayesian classifier vs. decision tree  Decision tree: predict the class label  Bayesian classifier: statistical classifier; predict class membership probabilities  Based on Bayes theorem; estimate posterior probability  Na ï ve Bayesian classifier:  Simple classifier that assumes attribute independence  High speed when applied to large databases  Comparable in performance to decision trees

50 CSE 300 Bayes Theorem  Let X be a data sample whose class label is unknown  Let H i be the hypothesis that X belongs to a particular class C i  P(H i ) is class prior probability that X belongs to a particular class C i  Can be estimated by n i /n from training data samples  n is the total number of training data samples  n i is the number of training data samples of class C i Formula of Bayes Theorem

51 CSE 300 More classification Techniques  Neural Networks  Similar to pattern recognition properties of biological systems  Most frequently used  Multi-layer perceptrons –Input with bias, connected by weights to hidden, output  Backpropagation neural networks  Support Vector Machines  Separate database to mutually exclusive regions  Transform to another problem space  Kernel functions (dot product)  Output of new points predicted by position  Comparison with classification trees  Not possible to know which features or combination of features most influence a prediction

52 CSE 300 Multilayer Perceptrons  Non-linear transfer functions to weighted sums of inputs  Werbos algorithm  Random weights  Training set, Testing set

53 CSE 300 Support Vector Machines  3 steps  Support Vector creation  Maximal distance between points found  Perpendicular decision boundary  Allows some points to be misclassified  Pima Indian data with X1(glucose) X2(BMI)

54 CSE 300 What is Association Rule Mining?  Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories Example of Association Rules {High LDL, Low HDL}  {Heart Failure} PatientIDConditions 1 High LDL Low HDL, High BMI, Heart Failure 2 High LDL Low HDL, Heart Failure, Diabetes 3Diabetes 4 High LDL Low HDL, Heart Failure 5 High BMI, High LDL Low HDL, Heart Failure   People who have high LDL (“bad” cholesterol), low HDL (“good cholesterol”) are at higher risk of heart failure.

55 CSE 300 Association Rule Mining  Market Basket Analysis  Same groups of items bought placed together  Healthcare  Understanding among association among patients with demands for similar treatments and services  Goal : find items for which joint probability of occurrence is high  Basket of binary valued variables  Results form association rules, augmented with support and confidence

56 CSE 300 Association Rule Mining  Association Rule   An implication expression of the form X  Y, where X and Y are itemsets and X  Y=   Rule Evaluation Metrics   Support (s): Fraction of transactions that contain both X and Y   Confidence (c): Measures how often items in Y appear in transactions that contain X Trans containing Y Trans containing both X and Y Trans containing X D

57 CSE 300 The Apriori Algorithm  Starts with most frequent 1-itemset  Include only those “items” that pass threshold  Use 1-itemset to generate 2-itemsets  Stop when threshold not satisfied by any itemset  L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do  Candidate Generation: C k+1 = candidates generated from L k ;  Candidate Counting: for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t  L k+1 = candidates in C k+1 with min_sup return  k L k ;

58 CSE 300 Apriori-based Mining

59 CSE 300 Principle Component Analysis  Principle Components  In cases of large number of variables, highly possible that some subsets of the variables are very correlated with each other. Reduce variables but retain variability in dataset  Linear combinations of variables in the database  Variance of each PC maximized –Display as much spread of the original data  PC orthogonal with each other –Minimize the overlap in the variables  Each component normalized sum of square is unity –Easier for mathematical analysis  Number of PC < Number of variables  Associations found  Small number of PC explain large amount of variance  Example 768 female Pima Indians evaluated for diabetes  Number of times pregnant, two-hour oral glucose tolerance test (OGTT) plasma glucose, Diastolic blood pressure, Triceps skin fold thickness, Two-hour serum insulin, BMI, Diabetes pedigree function, Age, Diabetes onset within last 5 years

60 CSE 300 PCA Example

61 CSE 300 National Cancer Institute  CancerNet http://www.nci.nih.gov http://www.nci.nih.gov  CancerNet for Patients and the Public  CancerNet for Health Professionals  CancerNet for Basic Reasearchers  CancerLit

62 CSE 300Conclusion  About ¾ billion of people’s medical records are electronically available  Data mining in medicine distinct from other fields due to nature of data: heterogeneous, with ethical, legal and social constraints  Most commonly used technique is classification and prediction with different techniques applied for different cases  Associative rules describe the data in the database  Medical data mining can be the most rewarding despite the difficulty

63 CSE 300 Thank you !!!

1 CSE 300 Data mining and its application and usage in medicine By Radhika.

Similar presentations

Presentation on theme: "1 CSE 300 Data mining and its application and usage in medicine By Radhika."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CSE 300 Data mining and its application and usage in medicine By Radhika.

Similar presentations

Presentation on theme: "1 CSE 300 Data mining and its application and usage in medicine By Radhika."— Presentation transcript:

Similar presentations

About project

Feedback