Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining: Discovering Information From Bio-Data Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell.

Similar presentations


Presentation on theme: "Data Mining: Discovering Information From Bio-Data Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell."— Presentation transcript:

1 Data Mining: Discovering Information From Bio-Data Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell

2 Introduction Data Mining Background Data Mining Background –Process –Functionalities –Techniques Two examples Two examples –Short Peptide –Clinical Records Conclusion Conclusion

3 Data Mining Background - Process

4 Functionalities Classification Classification Cluster Analysis Cluster Analysis Outlier Analysis Outlier Analysis Trend Analysis Trend Analysis Association Analysis Association Analysis

5 Techniques Decision Tree Decision Tree Bayesian Classification Bayesian Classification Hidden Markov Models Hidden Markov Models Support Vector Machines Support Vector Machines Artificial Neural Networks Artificial Neural Networks

6 Technique 1 – Decision Tree

7 Technique 2 – Bayesian Classification Based on Bayes Theorem Based on Bayes Theorem –Simple but comparable to Decision Trees and Neural Networks Classifier in many applications.

8 Technique 3 – Hidden Markov Model

9 Technique 4 – Support Vector Machine SVM find the maximum margin hyperplane that separate the classis SVM find the maximum margin hyperplane that separate the classis –Hyperplane can be represented as a linear combination of training points –The algorithm that finds a separating hyperplane in the feature space can be stated entirely in terms of vectors in the input space and dot products in the feature space –Locate a separating hyperplane in the feature space and classify points in that space simply by defining a kernel function

10 Example 1 – Short Peptides Problem Problem –Identify T-cell epitopes from Melanoma antigens –Training Set: 602 HLA-DR4 binding peptides 602 HLA-DR4 binding peptides 713 non-binding 713 non-binding Solution – Neural Networks Solution – Neural Networks

11 Neural Networks – Single Computing Element

12 Neural Networks Classifier Sparse Coding Sparse Coding –Alanine 10000000000000000000 9 x 20 = 180 bits per Inputs 9 x 20 = 180 bits per Inputs

13 Neural Networks – Error Back-Propagation Squared error: Squared error: Adjustment Adjustment Where is the output of the computing element of the first layer And  is the difference between the output y and correct output t. Where  is a fixed leaning rate

14 Result & Remarks Success Rate: 60% Success Rate: 60% A systematic experimental study is very expensive A systematic experimental study is very expensive Highly accurate predicting method can reduce the cost Highly accurate predicting method can reduce the cost Other alternatives exist Other alternatives exist

15 Datamining: Discovering Information A Clinical Records

16 Problem Problem : already known data (clinical records) Problem : already known data (clinical records)  predict unknown data  predict unknown data How to analysis known data ? How to analysis known data ? --- training data --- training data How to test unknown data? How to test unknown data? --- Predict data --- Predict data

17 The data has many attributes. The data has many attributes. Ex: Having 2300 combinations of Ex: Having 2300 combinations of attributes with 8 attributes for one attributes with 8 attributes for one class. class. It is impossible to calculate all manually It is impossible to calculate all manually Problem

18  One Example: Eight attributes for diabetic patients: Eight attributes for diabetic patients: (1)Number of times pregnant (1)Number of times pregnant (2)Plasma glucose (2)Plasma glucose (3)Diastolic blood pressure (3)Diastolic blood pressure (4)Triceps skin fold thickness (4)Triceps skin fold thickness (5)Two-hour serum insulin (5)Two-hour serum insulin (6)Body mass index (6)Body mass index (7)Diabetes pedigree (7)Diabetes pedigree (8)Age (8)Age

19 CAEP -Classification by aggregating emerging patterns A classification (known data) and prediction (unknown data) algorithms. A classification (known data) and prediction (unknown data) algorithms.

20 CAEP -Classification by aggregating emerging patterns Definition: Definition: (1)Training data  Discovery all the emerging patterns. (2)Training data  Sum and normalize the differentiating weight of these emerging patterns weight of these emerging patterns (3)Training data  Chooses the class with the largest normalized score as the winner. normalized score as the winner. (4)Test data  Computing the score of test data and making a Prediction making a Prediction

21 CAEP : Emerging Pattern Emerging Pattern Emerging Pattern Definition: Definition: An emerging pattern is a pattern with some An emerging pattern is a pattern with some attributes whose frequency increases significantly attributes whose frequency increases significantly from one class to another. from one class to another. EX: EX: MushroomPoisonousEdible SmellodorNone SurfaceWrinklesmooth Ring-number13

22 CAEP : Classification Classification: Definition: Definition: (1) Discover the factors that differentiate the two groups (1) Discover the factors that differentiate the two groups (2) Find a way to use these factors to predict to Which group a (2) Find a way to use these factors to predict to Which group a new patient should belong. new patient should belong.

23 CAEP : Method Method:  Discretize of the dataset into a binary one. item (attribute, interval) Ex:( age, >45) item (attribute, interval) Ex:( age, >45) instance : a set of items such that an item instance : a set of items such that an item (A,v) is in t if only if the value of the (A,v) is in t if only if the value of the attribute A of t is within the interval attribute A of t is within the interval

24 Clinical Record: 768 women 21% diabetic instances : 161 71% non-diabetics instances: 546

25 CAEP: Support Support of X (attribute) Definition: the ratio of number of items has this attribute Definition: the ratio of number of items has this attribute over the number of total items in this class. over the number of total items in this class. Formula: supp D (x)= Formula: supp D (x)= Meaning: If supp(x) is high which means attribute x Meaning: If supp(x) is high which means attribute x exist in many items in this class. exist in many items in this class. Example : How many people in diabetic class are Example : How many people in diabetic class are older than 60? (attribute : >60) older than 60? (attribute : >60) 148/161 =91% 148/161 =91%

26 CAEP: Growth The growth rate of X (attribute) Definition: The support comparison of same attributes Definition: The support comparison of same attributes from two classes. from two classes. Formula: grow D (x)= supp D (x) / supp D’ (x) Formula: grow D (x)= supp D (x) / supp D’ (x) Meaning: If grow(x) is high which means more possibility Meaning: If grow(x) is high which means more possibility of attribute X exist in class D than in class D’ of attribute X exist in class D than in class D’ Example: the patient older >60 in diabetic class is 91% Example: the patient older >60 in diabetic class is 91% the people older >60 in non-diabetic class is 10% the people older >60 in non-diabetic class is 10% growth(>60)= 91% / 10% = 9 growth(>60)= 91% / 10% = 9

27 CAEP: Likelihood Likelihood D (x) Definition: the ratio of total number of items with attribute x Definition: the ratio of total number of items with attribute x in one class to the total number of items with in one class to the total number of items with attribute x in both two classes. attribute x in both two classes. Formula1: Likelihood D (x)= supp D (x) * |D|_______________ Formula1: Likelihood D (x)= supp D (x) * |D|_______________ supp D (x) *|D| + supp D’ (x) *|D’| supp D (x) *|D| + supp D’ (x) *|D’| Formula2: If D and D’ are roughly equal in size: Formula2: If D and D’ are roughly equal in size: Likelihood D (x)= supp D (x) ____________ Likelihood D (x)= supp D (x) ____________ supp D (x) + supp D’ (x) supp D (x) + supp D’ (x) Example: 91% * 223___________ = 203 = 78.99% Example: 91% * 223___________ = 203 = 78.99% 91% *223 + 10% * 545 257 91% *223 + 10% * 545 257 Example: 91% _______ = 91% = 90.10% Example: 91% _______ = 91% = 90.10% 91% + 10% 101% 91% + 10% 101%

28 CAEP: Evaluation Sensitivity: the ratio of the number of correctly predicted diabetic instances to the number predicted diabetic instances to the number of diabetic instances. of diabetic instances. Example: 60 correctly predicted /100 diabetic =60% Specificity: the ratio of the number of correctly predicted diabetic instance to the number predicted diabetic instance to the number of predicted. of predicted. Example: 60 correctly predicted /120 predicted =50% Accuracy: the percentage of instances correctly classified. Example: 60 correctly predicted /180 =33%

29 CAEP: Evaluation  Using one attribute for class prediction High accuracy: High accuracy: Low sensitivity: only identify 30% Low sensitivity: only identify 30%

30 CAEP: Prediction  Consider all attributes: The accumulation of scores of all features it The accumulation of scores of all features it has for class D has for class D Formular: Score(t,D) =  X likelihood D (X)*supp D (x) Formular: Score(t,D) =  X likelihood D (X)*supp D (x)  Prediction: Score(t,D)>score(t,D’)  t belongs to D class. Score(t,D)>score(t,D’)  t belongs to D class.

31 CAEP: Normalize  If the numbers of emerging patterns are different significantly. One class D has more emerging patterns than another class D’ significantly. One class D has more emerging patterns than another class D’  The score of one instance of D has higher score than the instance of D’ Score(t,D) = likelihood D (X)*supp D (x) Score(t,D) = likelihood D (X)*supp D (x)  Normalize the score norm_score(t,D)=score(t,D) / base_score(D) norm_score(t,D)=score(t,D) / base_score(D)  Prediction: If norm_score(t,D)> norm_score(t,D’)  If norm_score(t,D)> norm_score(t,D’)  t belongs to D class. t belongs to D class.

32 CAEP: Comparison C4.5 and CBA Sensitivity Diabetic/non -diabetic Specificity Accuracy C4.571.1% CBA73.0% CAEP70.5%/63.3%77.4%/83%.175%

33 CAEP: Modify  Problem: CAEP produces a very large number of emerging patterns. of emerging patterns. Example: with 8 attribute, 2300 emerging Example: with 8 attribute, 2300 emerging patterns. patterns.

34 CAEP: Modify  Reduce emerging patterns numbers Method: Prefer strong emerging patterns over their Method: Prefer strong emerging patterns over their weaker relatives weaker relatives Example: X 1 with infinite growth,very small support Example: X 1 with infinite growth,very small support X 2 with less growth, much larger support, X 2 with less growth, much larger support, say 30 times than X 2 say 30 times than X 2 In such case X 2 is preferred because it In such case X 2 is preferred because it covers many more cases than X 1. covers many more cases than X 1.  There is no lose in prediction performance using reduction of emerging patterns

35 CAEP: Variations  JEP: using exclusively emerging patterns whose supports increase from zero to nonzero, are called jump. using exclusively emerging patterns whose supports increase from zero to nonzero, are called jump. Perform well when there are many jump emerging patterns Perform well when there are many jump emerging patterns  DeEP: It has more training phases is customized for that instance It has more training phases is customized for that instance Slightly better, incorporate new training data easily. Slightly better, incorporate new training data easily.

36 Relevance analysis  Datamining algorithms are in general exponential in complexity  Relevance analysis : exclude the attributes that do not contribute to the classification process exclude the attributes that do not contribute to the classification process  Deal with much higher dimension datasets  Not always useful for lower ranking dimensions.

37 Conclusion  Classification and prediction aspect of datamining  Method includes decision trees, mathematical formula, artificial neural networks, or emerging patterns.  They are applicable in a large variety of classification applications  CAEP has good predictive accuracy on all data sets..


Download ppt "Data Mining: Discovering Information From Bio-Data Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell."

Similar presentations


Ads by Google